From Noise to Control: Parameterized Diffusion Policies

arXiv cs.AI Papers

Summary

This paper introduces Parameterized Diffusion Policy (PDP), a framework that makes diffusion policies controllable by conditioning on low-dimensional latent parameters, enabling smooth behavior interpolation and adaptation without retraining. It demonstrates improved performance on complex multimodal robot tasks in simulation and real-world experiments.

arXiv:2606.00336v1 Announce Type: new Abstract: We propose Parameterized Diffusion Policy (PDP), a framework for learning diffusion policies conditioned on low-dimensional, continuous parameters embedded in a learned behavior manifold. By constructing this manifold so that distances between latent representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise and optimizable tool for behavior steering. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints without updating policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulated and real-robot experiments compared to standard diffusion policies, particularly in scenarios requiring the synthesis of novel behaviors.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:46 PM

# From Noise to Control: Parameterized Diffusion Policies
Source: [https://arxiv.org/html/2606.00336](https://arxiv.org/html/2606.00336)
Haotian FuMingxi JiaGeorge KonidarisYilun DuBruno Castro da Silva

###### Abstract

We propose Parameterized Diffusion Policy \(PDP\), a framework for learning diffusion policies conditioned on low\-dimensional, continuous parameters embedded in a learned behavior manifold\. By constructing this manifold so that distances between latent representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise and optimizable tool for behavior steering\. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints without updating policy weights\. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulated and real\-robot experiments compared to standard diffusion policies, particularly in scenarios requiring the synthesis of novel behaviors\.

Machine Learning, ICML

## 1Introduction

Real\-world robot tasks are often*underspecified*: a given goal in a particular environment may be achieved by many distinct action sequences\(Ivanovic et al\.,[2020](https://arxiv.org/html/2606.00336#bib.bib22); Lynch et al\.,[2020](https://arxiv.org/html/2606.00336#bib.bib35)\)\. A soccer\-playing robot, for example, may score via multiple feasible shot trajectories that all satisfy “ball enters goal” but differ in how they avoid defenders and satisfy constraints \(Figure[1](https://arxiv.org/html/2606.00336#S1.F1)\)\. This one\-to\-many structure is pervasive in manipulation and navigation\(Mandlekar et al\.,[2021](https://arxiv.org/html/2606.00336#bib.bib38); Florence et al\.,[2022](https://arxiv.org/html/2606.00336#bib.bib13); Chen et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib3)\)\. Consequently, expert datasets provided for robot policy learning are naturally*multimodal*, and an ideal policy is expected to both \(i\) incorporate such multimodal behaviors and \(ii\) find the*right*behavior to execute when constraints change\(da Silva et al\.,[2012](https://arxiv.org/html/2606.00336#bib.bib5); Dalal et al\.,[2021](https://arxiv.org/html/2606.00336#bib.bib6)\)\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x1.png)Figure 1:Observation\-side shift vs\. constraint\-induced behavior shift, and why PDP enables stable behavior steering\.Top: Standard evaluations vary observations while the intended behavior remains unchanged\. Middle: We study constraint\-induced behavior shifts, where environmental changes invalidate many of the trajectory modes in the training dataset, and success requires selecting or discovering a different trajectory mode\. Bottom: In standard DP, steering via the sampling noise is ill\-conditioned: small perturbations can lead to large, unpredictable trajectory changes\. PDP, by contrast, exposes a learned behavior latentzzwhose geometry is aligned with trajectory similarity, so small changes inzzinduce smooth changes in behavior, enabling interpolation and adaptation\.Diffusion policies have recently become a strong default for behavior cloning because they can model multimodal and high\-dimensional action sequences and produce diverse rollouts via stochastic denoising\(Chi et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib4)\)\. However, diversity alone is not the same as*controllability*\. Most standard evaluations predominantly focus on*observation\-side*distribution shift \(Figure[1](https://arxiv.org/html/2606.00336#S1.F1), top\)\(Zhu et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib67); Wolf et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib59)\), i\.e\., ensuring that changes in appearance or camera conditions do not affect the physically intended behavior\. In contrast, many deployment failures arise from*constraint\-induced behavior shifts*\(Figure[1](https://arxiv.org/html/2606.00336#S1.F1), middle\): a new obstacle, contact constraint, or feasibility condition can invalidate previously demonstrated strategies, forcing the robot to select or discover a novel behavior mode beyond those covered in the training dataset\(Konidaris & Barto,[2009](https://arxiv.org/html/2606.00336#bib.bib28); Pan et al\.,[2025a](https://arxiv.org/html/2606.00336#bib.bib43); Wagenmaker et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib56)\)\.

This paper targets the following setting: a diffusion policy is pretrained offline on multimodal demonstrations\. At deployment time, the task remains the same, but*constraints shift*so that only one or a few of the behavior modes covered in the training dataset remain viable\. In harder cases, none of the observed modes may satisfy the new constraints, requiring the agent to interpolate a new behavior mode\. Crucially, we wish adaptation to be*fast*and*data\-light*, potentially guided by only a single new demonstration\. This motivates two requirements that current diffusion\-policy control interfaces do not satisfy:

\(1\) Fast behavior steering via parameter updates\.When constraints unseen in training invalidate existing strategies, we would like to steer the policy at test time by adapting only a*small set of parameters*, rather than retraining the full policy\. In other words, adaptation should require only solving a compact optimization problem, or taking a few gradient steps, to produce a feasible behavior that reliably satisfies the new constraints\.

\(2\) A smooth, geometry\-aligned behavior manifold for generalization\.Beyond selecting among memorized modes, the policy should generalize*within*the training distribution by interpolating between behaviors to produce plausible trajectories that were not explicitly demonstrated\. This calls for a behavior space whose geometry is aligned with trajectory similarity: moving a small distance in this space should induce a correspondingly small, predictable change in the executed trajectory\. Such a space would support both mode interpolation \(to improve coverage\) and rapid adaptation to new constraints \(to preserve feasibility\)\.

We proposeParameterized Diffusion Policies \(PDP\): diffusion policies conditioned on a*continuous*behavior parameterzzlearned from demonstrations\. PDP learns a trajectory encoder that maps each demonstration to a latent code and optimizes the resulting representation so that it is*geometry\-aligned*: distances inzzreflect semantic similarity between physical trajectories\. Conditioning the diffusion denoiser onzzconverts diffusion from “noise\-driven diversity” into an explicit and optimizable control mechanism:zzbecomes a low\-dimensional handle for \(i\) selecting a desired mode, \(ii\) smoothly interpolating between behaviors to produce unobserved but in\-distribution trajectories, and \(iii\) rapidly adapting to constraint shifts by optimizingzzat test time, without updating policy weights\. We evaluate PDP in a benchmark suite designed specifically to isolate behavior\-side adaptation under constraint shifts with multimodal training data\. Across multiple simulation tasks and a real robot task, we find that PDP enables more reliable steering and adaptation than other diffusion policy and behavior cloning baselines\.

## 2Related Work

##### Diffusion Models and Latent Structure\.

Diffusion models have become a dominant class of generative models because they can represent complex, multimodal distributions, but their latent spaces are often poorly structured and their control signals can be difficult to optimize\. This limits their ability to smoothly interpolate and control outputs in both vision and sequential decision\-making settings\(Sohl\-Dickstein et al\.,[2015](https://arxiv.org/html/2606.00336#bib.bib51); Song & Ermon,[2019](https://arxiv.org/html/2606.00336#bib.bib52); Ho et al\.,[2020](https://arxiv.org/html/2606.00336#bib.bib20); Dhariwal & Nichol,[2021](https://arxiv.org/html/2606.00336#bib.bib7); Park et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib45); Pan et al\.,[2025b](https://arxiv.org/html/2606.00336#bib.bib44); Hahm et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib18); Moser et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib41); Karras et al\.,[2022](https://arxiv.org/html/2606.00336#bib.bib27); Nichol & Dhariwal,[2021](https://arxiv.org/html/2606.00336#bib.bib42); Song et al\.,[2021](https://arxiv.org/html/2606.00336#bib.bib53); Li et al\.,[2025a](https://arxiv.org/html/2606.00336#bib.bib30),[b](https://arxiv.org/html/2606.00336#bib.bib31)\)\. Recent analyses have highlighted geometric distortions in diffusion latent spaces and proposed disentangled or isometric latent learning to improve interpolation and editability\(Park et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib45); Pan et al\.,[2025b](https://arxiv.org/html/2606.00336#bib.bib44); Hahm et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib18)\)\. These limitations are even more consequential for control and decision tasks, where precise behavior steering is essential for safety and feasibility\. Consequently, recent surveys have framed such limitations as key opportunities for diffusion in robotics\(Zhu et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib67); Xu et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib61); Wolf et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib59); Makarova et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib37)\), while other works have explored RL\-guided training to align generative behaviors with objectives beyond simple likelihood\(Black et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib1); Miao et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib40); Wang et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib58)\)\.

##### Diffusion Policies and Behavior Steering\.

Building on generative diffusion models, diffusion policies have emerged as a powerful paradigm for robot imitation\(Chi et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib4); Wang et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib57); Li et al\.,[2024b](https://arxiv.org/html/2606.00336#bib.bib33); Jackson et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib23); Janner et al\.,[2022](https://arxiv.org/html/2606.00336#bib.bib24); Høeg et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib21)\)\. While expressive, these policies often require test\-time steering to adapt to novel constraints without retraining\(Du et al\.,[2023b](https://arxiv.org/html/2606.00336#bib.bib12); Yu et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib63)\)\. Current strategies typically steer a frozen policy by injecting external guidance, such as value\-function gradients, human feedback, or specialized geometric priors\(Ding et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib8); Li et al\.,[2025c](https://arxiv.org/html/2606.00336#bib.bib34); Shan et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib50); Du et al\.,[2023a](https://arxiv.org/html/2606.00336#bib.bib11); Du & Mordatch,[2019](https://arxiv.org/html/2606.00336#bib.bib10); Zhu et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib66)\)\. Other methods optimize directly in the high\-dimensional noise space to refine behaviors\(Wagenmaker et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib56); Li et al\.,[2024a](https://arxiv.org/html/2606.00336#bib.bib32); Kang et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib26); Ding et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib9)\)\. However, this optimization is often ill\-conditioned due to the lack of geometric regularization in the latent noise\(Park et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib45); Pan et al\.,[2025b](https://arxiv.org/html/2606.00336#bib.bib44); Hahm et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib18)\)\. To improve controllability, a parallel line of work learns explicit representations, often through discrete bottlenecks such as skill tokens or vector\-quantized codes\(Chen et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib3); Lee et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib29); Wu et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib60); Qiao et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib47); Ma et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib36)\)\. While these representations facilitate mode selection, they naturally restrict control to a finite behavior set\.

##### Parameterized Skills and Actions\.

Similar high\-level ideas about parameterized actions and skills have also been studied in reinforcement learning as mechanisms for structured abstraction and task transfer\(Masson et al\.,[2016](https://arxiv.org/html/2606.00336#bib.bib39); Hausknecht & Stone,[2016](https://arxiv.org/html/2606.00336#bib.bib19); Dalal et al\.,[2021](https://arxiv.org/html/2606.00336#bib.bib6); Zhang et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib64); Gupta et al\.,[2025](https://arxiv.org/html/2606.00336#bib.bib17)\)\. By learning explicit hierarchical representations, these methods improve sample efficiency and enable the composition of complex behaviors in long\-horizon tasks\(Fu et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib15); Zheng et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib65); Konidaris & Barto,[2009](https://arxiv.org/html/2606.00336#bib.bib28); Queisser & Steil,[2018](https://arxiv.org/html/2606.00336#bib.bib48); Sutton et al\.,[1999](https://arxiv.org/html/2606.00336#bib.bib54); Frans et al\.,[2018](https://arxiv.org/html/2606.00336#bib.bib14); Vezhnevets et al\.,[2017](https://arxiv.org/html/2606.00336#bib.bib55)\)\. PDP extends these classical concepts to the modern generative paradigm by learning a continuous behavior\-semantic manifold that parameterizes the diffusion denoising process\.

## 3Background

##### Diffusion models\.

Letx0∈ℝDx\_\{0\}\\in\\mathbb\{R\}^\{D\}denote a data vector \(e\.g\., an action chunk with dimensionDD\)\. Denoising diffusion probabilistic models \(DDPMs\) define a forward noising process\(Ho et al\.,[2020](https://arxiv.org/html/2606.00336#bib.bib20)\):

q​\(xk∣xk−1\)=𝒩​\(αk​xk−1,\(1−αk\)​I\),\\small q\(x\_\{k\}\\mid x\_\{k\-1\}\)=\\mathcal\{N\}\\\!\\left\(\\sqrt\{\\alpha\_\{k\}\}\\,x\_\{k\-1\},\\,\(1\-\\alpha\_\{k\}\)I\\right\),\(1\)wherek=1,…,Kk=1,\\dots,Kindexes the diffusion timestep, andxkx\_\{k\}denotes the state of the sample after applyingkksteps of the forward diffusion process tox0x\_\{0\}\. Letα¯k≜∏s=1kαs\\bar\{\\alpha\}\_\{k\}\\triangleq\\prod\_\{s=1\}^\{k\}\\alpha\_\{s\}\. Then, the marginal has the following closed form:

xk=α¯k​x0\+1−α¯k​ϵ,ϵ∼𝒩​\(0,I\)\.\\small x\_\{k\}=\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\\,x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{k\}\}\\,\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\)\.\(2\)A conditional diffusion model learns a noise predictorϵθ​\(xk,k,c\)\\epsilon\_\{\\theta\}\(x\_\{k\},k,c\), whereccdenotes an arbitrary conditioning variable, using the standard noise\-prediction objective:

ℒDDPM​\(θ\)=𝔼x0,k,ϵ​\[‖ϵ−ϵθ​\(xk,k,c\)‖22\]\.\\small\\mathcal\{L\}\_\{\\mathrm\{DDPM\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\_\{0\},k,\\epsilon\}\\left\[\\left\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(x\_\{k\},k,c\)\\right\\\|\_\{2\}^\{2\}\\right\]\.\(3\)

##### Diffusion policies for action chunks\.

A robot demonstration trajectory isτ=\{\(ot,at\)\}t=0T−1\\tau=\\\{\(o\_\{t\},a\_\{t\}\)\\\}\_\{t=0\}^\{T\-1\}, whereoto\_\{t\}is an observation andat∈ℝdaa\_\{t\}\\in\\mathbb\{R\}^\{d\_\{a\}\}is a continuous action\. Diffusion policies model*action chunks*of horizonHH\(Chi et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib4)\), denoted byAt=\{at,at\+1,…,at\+H−1\}A\_\{t\}=\\\{a\_\{t\},a\_\{t\+1\},\\dots,a\_\{t\+H\-1\}\\\}, conditioned on a context featurectc\_\{t\}, such as an encoding of recent observations\. A diffusion policyπθ​\(At∣ct\)\\pi\_\{\\theta\}\(A\_\{t\}\\mid c\_\{t\}\)instantiates \([2](https://arxiv.org/html/2606.00336#S3.E2)\)–\([3](https://arxiv.org/html/2606.00336#S3.E3)\) withx0≡Atx\_\{0\}\\equiv A\_\{t\}, producing chunks by iterative denoising and executing them in a receding\-horizon fashion\.

##### Test\-time inversion via latent/noise optimization\.

Given a trained generative modelGθ​\(⋅\)G\_\{\\theta\}\(\\cdot\), including diffusion models, a standard adaptation mechanism is to optimize an input variableuu\(e\.g\., a noise seed, a conditioning vector, or an internal latent\) while keepingθ\\thetafixed\. In particular, given a target behaviorτ~\\tilde\{\\tau\}, define the discrepancy loss asℒfit​\(u;τ~\)\\mathcal\{L\}\_\{\\mathrm\{fit\}\}\(u;\\tilde\{\\tau\}\)and solveu⋆=arg⁡minu⁡ℒfit​\(u;τ~\)\+λ​ℛ​\(u\),u^\{\\star\}=\\arg\\min\_\{u\}\\;\\mathcal\{L\}\_\{\\mathrm\{fit\}\}\(u;\\tilde\{\\tau\}\)\+\\lambda\\,\\mathcal\{R\}\(u\),whereℛ\\mathcal\{R\}enforces a prior or feasibility bias onuu\(e\.g\.,‖u‖22\\\|u\\\|\_\{2\}^\{2\}\)\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x2.png)Figure 2:PDP framework\.Training \(left\):The trajectory encoderEϕE\_\{\\phi\}embeds each demonstrationτ\\tauinto a behavior latent codezzsampled from a Gaussian posterior\. The latent representation is optimized via a joint objective: a standard VAE loss \(reconstructionℒrec\\mathcal\{L\}\_\{\\text\{rec\}\}and KL\-divergenceDKLD\_\{\\text\{KL\}\}\) preserves information and regularizes the latent distribution, while a geometry lossℒgeo\\mathcal\{L\}\_\{\\text\{geo\}\}aligns latent distancesδi​jz\\delta\_\{ij\}^\{z\}with physical trajectory similaritiesδi​jX\\delta\_\{ij\}^\{X\}computed via soft\-DTW\. In parallel, the denoiserϵθ\\epsilon\_\{\\theta\}is trained to predict the noiseϵ\\epsilonadded to action chunksAkA^\{k\}, conditioning on environmental observations andzzthrough global modulation\.Inference\-time fitting \(right\):Given a new demonstrationτ~\\tilde\{\\tau\}, the frozen encoderEϕE\_\{\\phi\}provides a semantic warm\-startz0z\_\{0\}\. This latent code is iteratively refined by minimizing the fitting lossℒfit\\mathcal\{L\}\_\{\\text\{fit\}\}via gradient descent \(∇zℒfit\\nabla\_\{z\}\\mathcal\{L\}\_\{\\text\{fit\}\}\) through the frozen denoiser\. After fitting, the optimizedzzis used to condition the denoiserϵθ\\epsilon\_\{\\theta\}, which generates the full action sequence by iteratively denoising a Gaussian seed into a trajectory adapted to the new constraints\.

## 4Method

We introduceParameterized Diffusion Policy \(PDP\), which extends the diffusion policy framework by incorporating a continuous behavior latent variablez∈ℝdzz\\in\\mathbb\{R\}^\{d\_\{z\}\}to explicitly model the multimodality of the training data\. PDP factorizes control into two components: \(i\) a geometry\-aligned behavior manifold that maps high\-dimensional demonstrations to a latent space where Euclidean distances reflect trajectory similarity, and \(ii\) a parameterized diffusion denoiser that generates action sequences conditioned on both environmental context and a target latent variablezz\. Unlike standard diffusion policies, which rely on stochastic noise for diversity, PDP exposeszzas a low\-dimensional control handle\. At test time, this interface enables precise mode selection, smooth behavior interpolation, and rapid adaptation to novel constraints by optimizing only the latentzz, while keeping the underlying policy parameters fixed \(Fig\.[2](https://arxiv.org/html/2606.00336#S3.F2)\)\.

### 4\.1Learning a Geometry\-Aligned Behavior Latent

PDP centers on learning a latent manifold that explicitly models the multimodality of the training data, such that the distance between two latent codesziz\_\{i\}andzjz\_\{j\}reflects the semantic similarity of their corresponding physical behaviors\. Rather than structuring this space directly on raw observations, we extract a behavior trace from each demonstrationτ\\tau: a state\-based feature sequence that captures the essential geometric and functional structure of the task execution \(e\.g\., positional states\)\. Mapping such traces to continuous latent codeszztransforms the high\-dimensional variability of demonstrations into a structured coordinate chart suitable for precise behavior steering and interpolation\.

##### Behavior Embedding Architecture\.

To map the complex variability of demonstrations into a continuous latent space, we learn a trajectory encoderEϕE\_\{\\phi\}\. This encoder maps each full trajectoryτ\\tauto a Gaussian posterior over latent codes:

qϕ​\(z∣τ\)=𝒩​\(μϕ​\(τ\),diag​\(σϕ2​\(τ\)\)\)\.\\vskip\-2\.84526pt\\small q\_\{\\phi\}\(z\\mid\\tau\)=\\mathcal\{N\}\\big\(\\mu\_\{\\phi\}\(\\tau\),\\,\\mathrm\{diag\}\(\\sigma^\{2\}\_\{\\phi\}\(\\tau\)\)\\big\)\.\(4\)For downstream policy conditioning, we use the deterministic mean representationz¯​\(τ\)≜μϕ​\(τ\)\\bar\{z\}\(\\tau\)\\triangleq\\mu\_\{\\phi\}\(\\tau\)\. Although the encoder processes the complete demonstration, a symmetric decoderDψD\_\{\\psi\}is employed during training to reconstruct the original trajectoryτ\\taufromzz, ensuring that the latent space preserves all necessary behavioral information\.

##### Joint Embedding Objective\.

The encoder\-decoder\(ϕ,ψ\)\(\\phi,\\psi\)is optimized via a multi\-objective loss that balances information density, distributional regularity, and metric alignment:

ℒembed=ℒrec\+βKL​ℒKL\+βgeo​ℒgeo\.\\small\\mathcal\{L\}\_\{\\text\{embed\}\}=\\mathcal\{L\}\_\{\\text\{rec\}\}\+\\beta\_\{\\text\{KL\}\}\\mathcal\{L\}\_\{\\text\{KL\}\}\+\\beta\_\{\\text\{geo\}\}\\mathcal\{L\}\_\{\\text\{geo\}\}\.\(5\)The first two terms correspond to the standard VAE evidence lower bound \(ELBO\)\. Thereconstruction loss,ℒrec=𝔼z∼qϕ\(⋅∣τ\)​\[‖τ−Dψ​\(z\)‖22\]\\mathcal\{L\}\_\{\\text\{rec\}\}=\\mathbb\{E\}\_\{z\\sim q\_\{\\phi\}\(\\cdot\\mid\\tau\)\}\\big\[\\\|\\tau\-D\_\{\\psi\}\(z\)\\\|\_\{2\}^\{2\}\\big\], trainszzto capture the salient features of the demonstration, while theprior regularization,ℒKL=DKL​\(qϕ​\(z∣τ\)∥𝒩​\(0,I\)\)\\mathcal\{L\}\_\{\\text\{KL\}\}=D\_\{\\text\{KL\}\}\\big\(q\_\{\\phi\}\(z\\mid\\tau\)\\,\\\|\\,\\mathcal\{N\}\(0,I\)\\big\), shapes the latent distribution toward a standard normal prior, facilitating test\-time sampling and preventing the manifold from collapsing\.

##### Geometry Alignment via Soft\-DTW\.

Recall that standard VAEs do not guarantee that latent proximity corresponds to behavioral similarity\. To makezza*smooth*space for search and interpolation, we explicitly align the Euclidean geometry in the latent space with the physical geometry of the trajectories\. In particular, for each demonstrationτ\\tau, we extract representative featuresX​\(τ\)X\(\\tau\), such as positional features in the proprioception state\. We then useDynamic Time Warping \(DTW\)onX​\(τ\)X\(\\tau\)to compute trajectory distances, since DTW provides a distance metricinvariant to temporal variations and execution speed\.

Unlike standardL2L\_\{2\}norms, which are sensitive to time shifts, DTW identifies an optimal alignment between two sequences,A=\(a1,…,an\)A=\(a\_\{1\},\.\.\.,a\_\{n\}\)andB=\(b1,…,bm\)B=\(b\_\{1\},\.\.\.,b\_\{m\}\), by matching points such that endpoint alignment and monotonicity are preserved\. Given a local cost matrixDD, whereDi,j=Δ​\(ai,bj\)D\_\{i,j\}=\\Delta\(a\_\{i\},b\_\{j\}\), the cumulative costri,jr\_\{i,j\}is computed via the recurrenceri,j=Di,j\+min⁡\{ri−1,j,ri,j−1,ri−1,j−1\}r\_\{i,j\}=D\_\{i,j\}\+\\min\\\{r\_\{i\-1,j\},r\_\{i,j\-1\},r\_\{i\-1,j\-1\}\\\}, where the final distance isrn,mr\_\{n,m\}\. Since themin\\minoperator is non\-differentiable, we employ the soft\-DTW relaxation\. This replaces the hard minimum with a smoothed operator:softminγ​\(x1,…,xk\)=−γ​log​∑i=1kexp⁡\(−xi/γ\)\\text\{softmin\}\_\{\\gamma\}\(x\_\{1\},\.\.\.,x\_\{k\}\)=\-\\gamma\\log\\sum\_\{i=1\}^\{k\}\\exp\(\-x\_\{i\}/\\gamma\), allowing gradients to flow back to the encoder\. A more comprehensive analysis is provided in Appendix[A\.1](https://arxiv.org/html/2606.00336#A1.SS1)\.

Next, for each pair of demonstrations\(τi,τj\)\(\\tau\_\{i\},\\tau\_\{j\}\), we define the trajectory distanceδi​jX\\delta^\{X\}\_\{ij\}and the latent distanceδi​jz\\delta^\{z\}\_\{ij\}as follows:

δi​jX≜dsoftDTWγ​\(X​\(τi\),X​\(τj\)\),δi​jz≜‖z¯​\(τi\)−z¯​\(τj\)‖2\.\\small\\delta^\{X\}\_\{ij\}\\triangleq d^\{\\gamma\}\_\{\\text\{softDTW\}\}\\big\(X\(\\tau\_\{i\}\),X\(\\tau\_\{j\}\)\\big\),\\quad\\delta^\{z\}\_\{ij\}\\triangleq\\\|\\bar\{z\}\(\\tau\_\{i\}\)\-\\bar\{z\}\(\\tau\_\{j\}\)\\\|\_\{2\}\.\(6\)Thegeometry lossℒgeo\\mathcal\{L\}\_\{\\text\{geo\}\}\(Eq\. \([5](https://arxiv.org/html/2606.00336#S4.E5)\)\) enforces a linear correspondence between these distances:

ℒgeo=𝔼\(i,j\)∼𝒫​\[\(δi​jz−κ​δi​jX\)2\],\\small\\mathcal\{L\}\_\{\\text\{geo\}\}=\\mathbb\{E\}\_\{\(i,j\)\\sim\\mathcal\{P\}\}\\big\[\\big\(\\delta^\{z\}\_\{ij\}\-\\kappa\\delta^\{X\}\_\{ij\}\\big\)^\{2\}\\big\],\(7\)whereκ\\kappais a scaling constant and𝒫\\mathcal\{P\}is the distribution over sampled trajectory pairs\. This alignment ensures the manifold is navigable, supporting smooth interpolation and predictable steering\. Appendix[A\.4](https://arxiv.org/html/2606.00336#A1.SS4)analyzes how the geometry\-aligned latent structure facilitates training and stabilizes low\-dimensional adaptation\.

### 4\.2Learning Parameterized Diffusion Policies

With a structured behavior manifold established, we now train a diffusion policyπθ\\pi\_\{\\theta\}that generates action chunksAtA\_\{t\}conditioned on both the environmental contextctc\_\{t\}and the behavior latent codezz\. Unlike standard diffusion policies, which rely on stochastic noise as an implicit source of multimodality, PDP treatszzas an explicit control signal that determines the mode and style of the generated trajectory\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x3.png)Figure 3:Global Modulation for Denoiser\.The behavior latent codezzis transformed by MLPs into layer\-specific parametersγ\(l\)\\gamma^\{\(l\)\}andβ\(l\)\\beta^\{\(l\)\}\. These are applied to feature mapsh\(l\)h^\{\(l\)\}via the affine transformationh\(l\)←γ\(l\)​\(z\)⊙h\(l\)\+β\(l\)​\(z\)h^\{\(l\)\}\\leftarrow\\gamma^\{\(l\)\}\(z\)\\odot h^\{\(l\)\}\+\\beta^\{\(l\)\}\(z\)\.##### Latent\-Conditioned Denoising via Global Modulation\.

As discussed in Section[3](https://arxiv.org/html/2606.00336#S3), we model the policy as an iterative denoising process\. Each action chunkAtA\_\{t\}within a demonstrationτ\\tauis associated with the behavior latent codez¯​\(τ\)\\bar\{z\}\(\\tau\)produced by the frozen encoderEϕE\_\{\\phi\}\. We train a neural denoiserϵθ\\epsilon\_\{\\theta\}to predict the noiseϵ\\epsilonadded to the action chunk at diffusion stepkk:

ϵ^=ϵθ​\(Atk,k,ct,z¯\),z¯≜sg​\(μϕ​\(τ\)\)\.\\vskip\-2\.84526pt\\small\\hat\{\\epsilon\}=\\epsilon\_\{\\theta\}\(A\_\{t\}^\{k\},k,c\_\{t\},\\bar\{z\}\),\\qquad\\bar\{z\}\\triangleq\\mathrm\{sg\}\(\\mu\_\{\\phi\}\(\\tau\)\)\.\(8\)This formulation incorporates two critical design choices for effective behavior steering: \(1\)Decoupled Representation Learning:the use of the stop\-gradient operatorsg​\(⋅\)\\mathrm\{sg\}\(\\cdot\)is essential to decouple the geometric shaping of the behavior manifold from denoiser training\. By treatingz¯\\bar\{z\}as a fixed input, we prevent the diffusion loss \(which is primarily concerned with mode fitting\) from collapsing or distorting the distance\-preserving properties enforced byℒgeo\\mathcal\{L\}\_\{\\text\{geo\}\}\. \(2\)Deep Behavior Integration:Rather than treatingz¯\\bar\{z\}as a standard input feature, we implement the denoiser as a modulated neural field\. The behavior latent code deeply reconfigures the network’s internal score field via a modulation architecture inspired by FiLM\(Perez et al\.,[2018](https://arxiv.org/html/2606.00336#bib.bib46)\)\. As illustrated in Figure[3](https://arxiv.org/html/2606.00336#S4.F3), this mechanism applies affine transformations to intermediate feature tensorsh\(ℓ\)h^\{\(\\ell\)\}at multiple layers:h\(ℓ\)←γ\(ℓ\)​\(z¯\)⊙h\(ℓ\)\+β\(ℓ\)​\(z¯\)h^\{\(\\ell\)\}\\leftarrow\\gamma^\{\(\\ell\)\}\(\\bar\{z\}\)\\odot h^\{\(\\ell\)\}\+\\beta^\{\(\\ell\)\}\(\\bar\{z\}\), whereγ\(ℓ\)\\gamma^\{\(\\ell\)\}andβ\(ℓ\)\\beta^\{\(\\ell\)\}are lightweight MLPs\. This hierarchical modulation provides a robust handle for shifting the output distribution across distinct behavioral modes, ensuring that low\-dimensional latent codes effectively steer high\-dimensional action generation\. Our ablation studies confirm that this deep integration is crucial, as naive concatenation is often ignored by the denoiser\.

##### Training and Inference\.

The policy parametersθ\\thetaare optimized by minimizing the latent\-conditioned noise\-prediction error:

ℒDP​\(θ\)=𝔼τ,t,k,ϵ​\[‖ϵ−ϵθ​\(Atk,k,ct,z¯​\(τ\)\)‖22\]\.\\small\\mathcal\{L\}\_\{\\text\{DP\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\tau,t,k,\\epsilon\}\\big\[\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(A\_\{t\}^\{k\},k,c\_\{t\},\\bar\{z\}\(\\tau\)\)\\\|\_\{2\}^\{2\}\\big\]\.\(9\)We use ajoint trainingscheme in which we alternate between \(i\) updating the embedding parameters\(ϕ,ψ\)\(\\phi,\\psi\)usingℒembed\\mathcal\{L\}\_\{\\text\{embed\}\}and \(ii\) updating the policy parametersθ\\thetausingℒDP\\mathcal\{L\}\_\{\\text\{DP\}\}\. A detailed procedural breakdown of this scheme is provided in Appendix[A\.2](https://arxiv.org/html/2606.00336#A1.SS2)\.

### 4\.3Test\-Time Control and Latent Adaptation

Once trained, PDP provides a continuous, optimizable interface for steering the policy without updating policy weights\. While standard diffusion policies rely on stochastic sampling to discover behaviors, PDP treats the behavior manifold as a structured “search space”\. The high\-level intuition is to replace unstructured noise sampling with gradient\-based optimization in latent parameter space to identify the specific latent codezzthat best satisfies the new environmental constraints encountered at deployment time\. This change turns test\-time adaptation into a more stable, well\-conditioned refinement process\.

##### Latent Fitting with Encoder Warm\-Start\.

When environmental constraints shift or a specific target behaviorτ~\\tilde\{\\tau\}is required, we infer the optimal latent codez∗z^\{\*\}that aligns the \(frozen\) diffusion policy with the given requirements\. Unlike prior inversion methods that optimize in high\-dimensional noise spaces or unregularized latent spaces, we use our geometry\-aligned encoder to provide a*semantic warm\-start*:z0=μϕ​\(X​\(τ~\)\)\.z\_\{0\}=\\mu\_\{\\phi\}\(X\(\\tilde\{\\tau\}\)\)\.This initialization placeszznear the relevant region of the behavior manifold, improving optimization efficiency and stability\. Starting fromz0z\_\{0\}, we refine the latent code by minimizing a diffusion\-consistent fitting objective:

ℒfit​\(z;τ~\)=𝔼t,k,ϵ​\[‖ϵ−ϵθ​\(A~tk,k,c~t,z\)‖22\]\+λ​‖z‖22,\\small\\mathcal\{L\}\_\{\\text\{fit\}\}\(z;\\tilde\{\\tau\}\)=\\mathbb\{E\}\_\{t,k,\\epsilon\}\\Big\[\\big\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(\\tilde\{A\}\_\{t\}^\{k\},k,\\tilde\{c\}\_\{t\},z\)\\big\\\|\_\{2\}^\{2\}\\Big\]\+\\lambda\\\|z\\\|\_\{2\}^\{2\},\\vskip\-5\.69054pt\(10\)whereA~tk\\tilde\{A\}\_\{t\}^\{k\}denotes the action chunk from the target demonstration noised according to the forward process\. The first term in Eq\. \([10](https://arxiv.org/html/2606.00336#S4.E10)\) identifies the latent code under which the frozen denoiser best matches the target actions, while the quadratic regularizer ensureszzremains within the support of the training distribution\. We then solve for the optimal latent code via gradient descent:

z←z−η​∇zℒfit​\(z;τ~\),\\small z\\leftarrow z\-\\eta\\nabla\_\{z\}\\mathcal\{L\}\_\{\\text\{fit\}\}\(z;\\tilde\{\\tau\}\),\(11\)typically runningSSsteps before execution to obtain a precise control handle for the resulting rollout\. The complete test\-time latent adaptation procedure is formalized in Appendix[A\.3](https://arxiv.org/html/2606.00336#A1.SS3)\.

##### Long\-Horizon Adaptation via Segmented Latents\.

For complex tasks consisting of distinct functional phases, a single global latent code may be insufficient to capture all required behavioral variation\. PDP naturally extends to these settings by supporting piecewise\-constant latent sequences\. In particular, given a segmented demonstrationτ~=\{τ~\(m\)\}m=1M\\tilde\{\\tau\}=\\\{\\tilde\{\\tau\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}, we fit an independent latent codez\(m\)z^\{\(m\)\}for each phasemmby minimizing the joint objective:

minz\(1\),…,z\(M\)​∑m=1Mℒfit​\(z\(m\);τ~\(m\)\)\.\\vskip\-5\.69054pt\\small\\min\_\{z^\{\(1\)\},\\dots,z^\{\(M\)\}\}\\;\\sum\_\{m=1\}^\{M\}\\mathcal\{L\}\_\{\\text\{fit\}\}\(z^\{\(m\)\};\\tilde\{\\tau\}^\{\(m\)\}\)\.\(12\)This allows the policy to transition through a sequence of behavior\-specific regions of the latent manifold, enabling the synthesis of complex, long\-horizon maneuvers while maintaining the interpretability and steerability of the individual latent segments\.

## 5Experiments

![Refer to caption](https://arxiv.org/html/2606.00336v1/x4.png)Figure 4:Benchmark domains for evaluating controllable multimodal imitation\.Top row: Training environments with diverse expert demonstrations\. InOpenDrawerandCloseDrawer, experts use varying approach paths to reach the handle; inMeatOffGrillandBlockPlacement, the training dataset consists of distinct combinations of reaching and carrying trajectories; inPickUpCup, the robot must select between four discrete handles placed along the cup’s rim to initiate a grasp\.Bottom row: Testing environments featuring constraint\-induced behavior shifts, showing one of the three evaluation variants tested for each task\. Red barriers are introduced to obstruct trajectories associated with certain training modes \(making them infeasible\), or handles are removed to leave only a single feasible strategy\.Our experiments are designed to systematically evaluate the capacity of PDP to transform diffusion\-based behavior cloning into a controllable and adaptable control framework\. We evaluate four key properties: \(i\)effective behavior steering, demonstrating PDP’s ability to reliably adapt by selecting or synthesizing feasible strategies under environmental constraint shifts; \(ii\)original\-task fidelity, demonstrating that behavior latent codes are sufficiently expressive to reconstruct multimodal distributions present in the training data; \(iii\)geometry\-driven generalization, analyzing how the structured latent space enables smooth behavior interpolation; and \(iv\)architectural efficacy, verifying through ablation studies that each algorithmic component is necessary\. We evaluate PDP across a suite of nine manipulation domains that exhibit high multimodality under identical initial conditions\. The domains span both simulation and real\-robot hardware and are designed to test strategy discovery, selection, and adaptation\. The code and experiment configurations are available at https://github\.com/Valarzz/pdp\.

##### Multimodal Dataset Construction\.

We create aTrue multimodal datasetfor each task, capturing both continuous variability within each mode and qualitatively distinct behaviors across modes\. The benchmarks include single\-stage domains \(CloseDrawer,OpenDrawer,PickUpCup,OpenDoor,OpenMicrowave,Avoiding24,Avoiding32\), which require selecting of a single execution mode, and long\-horizon domains \(PlaceBlock,MeatOffGrill\), where execution modes arise from the combinatorial composition of mode choices across independent reaching and carrying stages\. More details are provided in Appendix[B\.1](https://arxiv.org/html/2606.00336#A2.SS1)\.

##### Baselines\.

For all domains, we compare PDP against eight state\-of\-the\-art competitors spanning the main methodological families for this setting: standard Diffusion Policy \(DP\)\(Chi et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib4)\), Behavior Cloning \(BC\)\(Mandlekar et al\.,[2021](https://arxiv.org/html/2606.00336#bib.bib38)\), BC with Gaussian Mixtures \(BC\-GMM\)\(Calinon et al\.,[2007](https://arxiv.org/html/2606.00336#bib.bib2)\), Implicit Behavioral Cloning \(IBC\)\(Florence et al\.,[2022](https://arxiv.org/html/2606.00336#bib.bib13)\), VQ\-BeT\(Lee et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib29)\), BESO\(Reuss et al\.,[2023](https://arxiv.org/html/2606.00336#bib.bib49)\), Diffusion\-ES \(Diff\-ES\)\(Yang et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib62)\), and ADPro\(Li et al\.,[2025c](https://arxiv.org/html/2606.00336#bib.bib34)\)\. Together, these methods cover the major paradigms for modeling and adapting multimodal robot behavior: generative, unimodal, explicit mixture, energy\-based, vector\-quantized, goal\-conditioned, evolutionary\-search, and test\-time\-adaptive approaches\.

Table 1:Success rate \(%\) onoriginalsimulation scenes before constraint shifts \(mean±\\pmstd over 3 random seeds\)\. The best performance in each row is bolded\.Table 2:Success rate \(%\) onconstraint\-shiftedsimulation scenes for eight manipulation domains, comparing nine methods \(mean±\\pmstd over 3 seeds\)\. Scenes1–2introduce constraint shifts that make all but one training mode infeasible\. Scene3is azero\-mode\-feasiblesetting: the new constraints makealltraining modes infeasible, and a single new demonstration is provided for adaptation\. The best performance in each row is shown in bold\.
### 5\.1Simulation and Real\-Robot Results

##### Performance in Original Multimodal Domains\.

We first evaluate each method in the original multimodal domains, before introducing any constraint shifts, to measure how well PDP preserves the behavior diversity present in the training data\. As shown in Table[1](https://arxiv.org/html/2606.00336#S5.T1), PDP achieves consistently high success rates across all tasks, including those with a large number of distinct behavior modes in the training data\. This indicates that conditioning the diffusion process on a learned behavior latent code does not compromise fidelity in the original setting\.

Standard DP, while generally outperforming unimodal BC, exhibits markedly lower success on tasks where distinct strategies induce highly divergent action geometries\. Although DP can represent multimodal action distributions, unconditioned denoising struggles to reliably determine which strategy to execute, among a set of incompatible alternatives, often leading to noisy and inconsistent rollouts\. By parameterizing behavior explicitly, PDP separates strategy selection from within\-strategy denoising, simplifying learning and yielding substantially more stable execution\. This effect is further illustrated by the trajectory visualizations in Appendix[C\.2](https://arxiv.org/html/2606.00336#A3.SS2): while non\-diffusion baselines struggle to capture the underlying multimodal structure and DP fails to consistently steer specific modes, PDP robustly reconstructs intended trajectories with minimal intra\-mode noise\.

##### Adaptation to Constraint\-Induced Shifts\.

We next evaluate behavior generalization under constraint\-induced shifts, where environmental changes make most behavior modes covered in the training data infeasible while leaving the task goal unchanged\. Each domain is tested under two evaluation regimes: \(i\)Existing Mode Fitting, with two*constraint\-shifted scenes*\(Scenes 1–2\) in which obstacles or affordance removals make all but one training mode infeasible; and \(ii\)Novel Behavior Discovery, a*zero\-mode\-feasible*setting \(Scene 3\) where no training mode remains valid and success requires discovering a qualitatively new behavior\. Fig\.[4](https://arxiv.org/html/2606.00336#S5.F4)depicts each domain’s original scene together with one representative constraint shift; the remaining shift variants and task specifications are detailed in Appendix[B\.3](https://arxiv.org/html/2606.00336#A2.SS3)\. Due to space constraints, Table[2](https://arxiv.org/html/2606.00336#S5.T2)reports results on four representative domains, and we defer the complete results across all eight domains to Appendix[B\.2](https://arxiv.org/html/2606.00336#A2.SS2)\.

In all constraint\-shifted scenes, we provide a single successful demonstrationτ~\\tilde\{\\tau\}and evaluate each method’s ability to adapt from it\. PDP aims to enable such adaptation without retraining\. PDP adapts by optimizing only the behavior latent codezzvia test\-time fitting; DP performs optimization in its high\-dimensional noise space; BC fine\-tunes policy weights with a small learning rate; BC\-GMM selects mixture components via posterior likelihood; IBC biases inference towardτ~\\tilde\{\\tau\}; Diff\-ES uses the DTW distance toτ~\\tilde\{\\tau\}as its evolutionary\-search objective; ADPro guides denoising with the gradient of a soft\-DTW objective relative toτ~\\tilde\{\\tau\}; and BESO and VQ\-BeT condition on waypoints extracted fromτ~\\tilde\{\\tau\}as goals\. Additional details are provided in Appendix[B\.4](https://arxiv.org/html/2606.00336#A2.SS4)\.

As shown in Table[2](https://arxiv.org/html/2606.00336#S5.T2), PDP maintains high success across all constraint\-shifted evaluations, while all baseline methods exhibit severe performance degradation\.

In theExisting Mode Fittingsetting, PDP succeeds nearly perfectly across all tasks, demonstrating that the learned latent space provides a stable control handle for steering the policy\. In contrast, standard DP and other frozen\-model baselines fail to consistently converge, despite the correct strategy being present in the training data\. Although BC achieves moderate success by explicitly fine\-tuning its network parameters using the new demonstration, this form of adaptation relies on parameter updating rather than the model’s intrinsic capacity for controllable behavior selection\.

In theNovel Behavior Discoverysetting, PDP remains the only method that consistently succeeds across domains, indicating that its geometry\-aligned latent space defines a behavior manifold that can be navigated to synthesize feasible new strategies\. All baselines fail in this regime, including BC, despite being allowed to update its weights using the new demonstration, suggesting that local parameter adaptation alone is insufficient to synthesize behaviors absent from the training data without an explicitly structured behavior space\.

##### Real\-Robot Robustness to Stochasticity\.111Videos of all real\-robot experiments are available at[https://sites\.google\.com/view/parameterized\-dp](https://sites.google.com/view/parameterized-dp)\.

In addition to simulation experiments, we evaluate PDP on a real\-world manipulation task using a Franka Emika Panda robot arm\. The task isOpenDrawer, which requires the robot to reach a drawer handle, establish contact, and pull the drawer open to a target configuration\. Real\-world demonstrations in this task are characterized by non\-smooth trajectories and significant intra\-mode variability \(see Appendix[B\.1](https://arxiv.org/html/2606.00336#A2.SS1)\)\. Despite this, Table[3](https://arxiv.org/html/2606.00336#S5.T3)shows that PDP achieves a100%100\\%success rate \(5/55/5\) across all scenes, including in the challenging zero\-mode adaptation setting\. Although standard DP matches PDP’s performance in the original scene, its success drops to as low as1/51/5\(20%20\\%success rate\) under constraint shifts\. These results demonstrate that PDP’s learned behavior latent space is robust to real\-world execution noise and provides a more stable handle for test\-time policy steering\.

Table 3:Real\-robotOpenDrawerresults \(successes / 5 trials\)\.![Refer to caption](https://arxiv.org/html/2606.00336v1/x5.png)Figure 5:Generalization via latent\-space navigation\.The learned behavior latent space organizes demonstrations into compact clusters, with Euclidean distances reflecting trajectory similarity\. Navigating the manifold by smoothly interpolating between latent clusters allows the denoiserϵθ\\epsilon\_\{\\theta\}to generate discover novel behaviors between demonstrated modes\.

### 5\.2Generalization via Behavior\-Space Navigation

We provide a qualitative analysis of the learned behavior manifold to support our claims regarding geometric regularity and the resulting steerability of the policy\. Fig\.[5](https://arxiv.org/html/2606.00336#S5.F5)depicts the behavior embedding and synthesis process onCloseDrawer\. As expected from the metric alignment enforced byℒgeo\\mathcal\{L\}\_\{\\text\{geo\}\}, the latent space is organized into compact, well\-separated clusters corresponding to the demonstration modes\. To test the manifold’s capacity for novel behavior synthesis, we generate trajectories by conditioning the policy on an interpolated latent codez​\(λ\)z\(\\lambda\), whereλ\\lambdaindexes positions along a chosen path in the latent space, without any additional training\. Asz​\(λ\)z\(\\lambda\)traverses the latent manifold along elliptical sweeps or linear paths between clusters, the resulting end\-effector \(EE\) trajectories exhibit smooth semantic transitions\. This behavior empirically supports our dynamical\-systems analysis in Appendix[A\.4](https://arxiv.org/html/2606.00336#A1.SS4): by navigating between modes in a low\-dimensional coordinate chart, we shift the equilibrium of the denoiser’s probability flow to synthesize intermediate, physically plausible strategies that are not exact replicas of the training data\. Beyond the example shown here, Appendix[C\.3](https://arxiv.org/html/2606.00336#A3.SS3)provides additional results in different domains, investigating other types of paths that can be followed while navigating the latent space\.

### 5\.3Ablation Studies

We conduct a series of ablation studies on theCloseDrawerandPickUpCupdomains to evaluate the contribution of our core architectural choices\. Unless otherwise specified, all variants are evaluated on the*zero\-mode\-feasible*variant \(Scene 3\) to test their impact on novel behavior adaptation\.

##### Impact of Geometry Alignment & Latent Continuity\.

We first evaluate the need for metric alignment and for a continuous behavior manifold by comparing three variants defined by their latent construction: \(i\)Isometric, where PDP is conditioned onz¯\\bar\{z\}from an encoder trained withℒembed\\mathcal\{L\}\_\{\\text\{embed\}\}\); \(ii\)Unaligned, where PDP is conditioned onz¯\\bar\{z\}from an encoder trained withβgeo=0\\beta\_\{\\text\{geo\}\}=0\); and \(iii\)Discrete, where PDP is conditioned on a categorical one\-hot embedding\.

Table[4](https://arxiv.org/html/2606.00336#S5.T4)shows that removing geometric alignment causes a severe drop in adaptation performance, while the Discrete variant fails entirely\. The Isometric latent space, by contrast, organizes behaviors according to physical trajectory similarity, so interpolation yields smooth, intermediate strategies\. The Unaligned space, on the other hand, is geometrically warped\. Effective test\-time adaptation requires both a geometry\-aligned latent and a continuous behavior manifold; without either, gradient\-based latent fitting cannot reliably synthesize novel behaviors\.

Table 4:Effect of latent integration strategy in the denoiser\.
##### Importance of Global Modulation for Latent Integration\.

We now investigate how the method used to integrate the behavior signal affects the denoiser’s ability to resolve mode ambiguity\. We compare: \(i\)Global Modulation\(PDP using affine feature\-map modulation\), \(ii\)Concatenation\(PDP withzzappended to the input\), and \(iii\)Unconditioned\(standard DP without a latentzz\)\.

The results in Table[5](https://arxiv.org/html/2606.00336#S5.T5)show that our global modulation architecture yields substantially higher adaptation success rates than naive concatenation, while also reducing diffusion training loss by nearly an order of magnitude \(Appendix[C\.1](https://arxiv.org/html/2606.00336#A3.SS1)\)\. This stark contrast supports our claim that deep, explicit latent integration simplifies diffusion from a global mixture\-modeling problem into a mode\-specific denoising process, facilitating training and enabling the denoiser to more reliably track the target behavior\.

Table 5:Effect of behavior latent integration on adaptation performance\.
##### Importance of Semantic Warm\-Starts\.

Finally, we evaluate the role of the learned encoderEϕE\_\{\\phi\}in the test\-time fitting process\. We compare two strategies for initializing latent optimization: using the encoder predictionz0=Eϕ​\(τ~\)z\_\{0\}=E\_\{\\phi\}\(\\tilde\{\\tau\}\)\(Warm\) and using a random Gaussian sample \(Rand\)\. Table[6](https://arxiv.org/html/2606.00336#S5.T6)shows that theWarminitialization achieves high success with as few as1010optimization steps\. In contrast, random initialization fails to reach viable strategies even after100100steps\. These results show that our geometry\-aligned encoder provides a critical semantic initialization, placing the optimization in a well\-conditioned region of the behavior manifold\.

Table 6:Effect of initialization strategy & latent optimization steps\.

## 6Conclusion

We propose Parameterized Diffusion Policy, a framework that learns a continuous and geometry\-aligned behavior space for steering diffusion policies\. By structuring a latent manifold so that distances between representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise tool for behavior steering\. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints\. We demonstrate that PDP significantly improves success rates on complex multimodal benchmarks in both simulated and real\-robot experiments, particularly in scenarios requiring the synthesis of entirely new behaviors\.

## Acknowledgments

We thank Scott Niekum for early discussions related to this project and Zilai Zeng for discussions on the real\-robot setup\. This work used computational resources and services provided by the Unity Research Computing Platform\. We also thank the anonymous reviewers for their comments and suggestions, which helped strengthen and clarify this work\.

## Impact Statement

This paper presents work whose goal is to improve the controllability and adaptability of diffusion\-based robot policies under changing task constraints\. By learning a geometry\-aligned latent behavior space, the proposed method provides an explicit interface for adapting to new constraints, which may help make robot policies more reliable and data\-efficient in settings where feasible behaviors change at deployment time\. These benefits should be interpreted within the limits of the method: adaptation relies on a successful demonstration, and the adapted behavior is expected to lie within or near the learned behavior space induced by existing demonstrations\.

## References

- Black et al\. \(2024\)Black, K\., Janner, M\., Du, Y\., Kostrikov, I\., and Levine, S\.Training diffusion models with reinforcement learning\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.
- Calinon et al\. \(2007\)Calinon, S\., Guenter, F\., and Billard, A\.On learning, representing, and generalizing a task in a humanoid robot\.*IEEE Transactions on Systems, Man, and Cybernetics, Part B \(Cybernetics\)*, 2007\.
- Chen et al\. \(2023\)Chen, L\., Bahl, S\., and Pathak, D\.Playfusion: Skill acquisition via diffusion from language\-annotated play\.In*Proceedings of the Conference on Robot Learning \(CoRL\)*, 2023\.
- Chi et al\. \(2023\)Chi, C\., Feng, S\., Du, Y\., Xu, Z\., Cousineau, E\., Burchfiel, B\., and Song, S\.Diffusion policy: Visuomotor policy learning via action diffusion\.In*Proceedings of Robotics: Science and Systems \(RSS\)*, 2023\.
- da Silva et al\. \(2012\)da Silva, B\. C\., Konidaris, G\., and Barto, A\.Learning parameterized skills\.*arXiv preprint arXiv:1206\.6398*, 2012\.
- Dalal et al\. \(2021\)Dalal, M\., Pathak, D\., and Salakhutdinov, R\.Accelerating robotic reinforcement learning via parameterized action primitives\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2021\.
- Dhariwal & Nichol \(2021\)Dhariwal, P\. and Nichol, A\.Diffusion models beat gans on image synthesis\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2021\.
- Ding et al\. \(2024\)Ding, S\., Hu, K\., Zhang, Z\., Ren, K\., Zhang, W\., Yu, J\., Wang, J\., and Shi, Y\.Diffusion\-based reinforcement learning via q\-weighted variational policy optimization\.*Advances in Neural Information Processing Systems*, 37:53945–53968, 2024\.
- Ding et al\. \(2025\)Ding, S\., Hu, K\., Zhong, S\., Luo, H\., Zhang, W\., Wang, J\., Wang, J\., and Shi, Y\.Genpo: Generative diffusion models meet on\-policy reinforcement learning\.*CoRR*, abs/2505\.18763, 2025\.doi:10\.48550/ARXIV\.2505\.18763\.
- Du & Mordatch \(2019\)Du, Y\. and Mordatch, I\.Implicit generation and modeling with energy based models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2019\.
- Du et al\. \(2023a\)Du, Y\., Durkan, C\., Strudel, R\., Tenenbaum, J\. B\., Dieleman, S\., Fergus, R\., Sohl\-Dickstein, J\., Doucet, A\., and Grathwohl, W\.Reduce, reuse, recycle: Compositional generation with energy\-based diffusion models and mcmc\.In*International Conference on Machine Learning \(ICML\)*, 2023a\.
- Du et al\. \(2023b\)Du, Y\., Yang, S\., Dai, B\., Dai, H\., Nachum, O\., Tenenbaum, J\., Schuurmans, D\., and Abbeel, P\.Learning universal policies via text\-guided video generation\.*Advances in neural information processing systems*, 36:9156–9172, 2023b\.
- Florence et al\. \(2022\)Florence, P\., Lynch, C\., Zeng, A\., Ramirez, O\. A\., Wahid, A\., Downs, L\., Adrianos, A\., Hsu, C\.\-Y\., and Chi, C\.Implicit behavioral cloning\.In*Conference on Robot Learning \(CoRL\)*, 2022\.
- Frans et al\. \(2018\)Frans, K\., Ho, J\., Chen, X\., Abbeel, P\., and Schulman, J\.Meta learning shared hierarchies\.In*6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 \- May 3, 2018, Conference Track Proceedings*\. OpenReview\.net, 2018\.
- Fu et al\. \(2023\)Fu, H\., Yu, S\., Tiwari, S\., Littman, M\., and Konidaris, G\.Meta\-learning parameterized skills\.In Krause, A\., Brunskill, E\., Cho, K\., Engelhardt, B\., Sabato, S\., and Scarlett, J\. \(eds\.\),*International Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA*, volume 202 of*Proceedings of Machine Learning Research*, pp\. 10461–10481\. PMLR, 2023\.
- Gupta et al\. \(2019\)Gupta, A\., Kumar, V\., Lynch, C\., Levine, S\., and Hausman, K\.Relay policy learning: Solving long\-horizon tasks via imitation and reinforcement learning\.In*Conference on Robot Learning \(CoRL\)*, 2019\.
- Gupta et al\. \(2025\)Gupta, V\., Fu, H\., Luo, C\., Jiang, Y\., and Konidaris, G\.Learning parameterized skills from demonstrations\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2025\.
- Hahm et al\. \(2024\)Hahm, J\., Lee, J\., Kim, S\., and Lee, J\.Isometric representation learning for disentangled latent space of diffusion models\.In*Forty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024*\. OpenReview\.net, 2024\.
- Hausknecht & Stone \(2016\)Hausknecht, M\. J\. and Stone, P\.Deep reinforcement learning in parameterized action space\.In Bengio, Y\. and LeCun, Y\. \(eds\.\),*4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2\-4, 2016, Conference Track Proceedings*, 2016\.
- Ho et al\. \(2020\)Ho, J\., Jain, A\., and Abbeel, P\.Denoising diffusion probabilistic models\.*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- Høeg et al\. \(2025\)Høeg, S\. H\., Du, Y\., and Egeland, O\.Fast policy synthesis with variable noise diffusion models\.In*IEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19\-23, 2025*, pp\. 4821–4828\. IEEE, 2025\.doi:10\.1109/ICRA55743\.2025\.11127858\.
- Ivanovic et al\. \(2020\)Ivanovic, B\., Leung, K\., Schmerling, E\., and Pavone, M\.Multimodal deep generative models for trajectory prediction: A conditional variational autoencoder approach\.*IEEE Robotics and Automation Letters*, 6\(2\):295–302, 2020\.
- Jackson et al\. \(2024\)Jackson, M\. T\., Matthews, M\. T\., Lu, C\., Ellis, B\., Whiteson, S\., and Foerster, J\.Policy\-guided diffusion\.*arXiv preprint arXiv:2404\.06356*, 2024\.
- Janner et al\. \(2022\)Janner, M\., Du, Y\., Tenenbaum, J\. B\., and Levine, S\.Planning with diffusion for flexible behavior synthesis\.In*International Conference on Machine Learning \(ICML\)*, 2022\.
- Jia et al\. \(2024\)Jia, X\., Blessing, D\., Jiang, X\., Reuss, M\., Donat, A\., Lioutikov, R\., and Neumann, G\.Towards diverse behaviors: A benchmark for imitation learning with human demonstrations\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Kang et al\. \(2023\)Kang, B\., Ma, X\., Du, C\., Pang, T\., and Yan, S\.Efficient diffusion policies for offline reinforcement learning\.In Oh, A\., Naumann, T\., Globerson, A\., Saenko, K\., Hardt, M\., and Levine, S\. \(eds\.\),*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*, 2023\.
- Karras et al\. \(2022\)Karras, T\., Aittala, M\., Aila, T\., and Laine, S\.Elucidating the design space of diffusion\-based generative models\.In Koyejo, S\., Mohamed, S\., Agarwal, A\., Belgrave, D\., Cho, K\., and Oh, A\. \(eds\.\),*Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022*, 2022\.
- Konidaris & Barto \(2009\)Konidaris, G\. D\. and Barto, A\. G\.Skill discovery in continuous reinforcement learning domains using skill chaining\.In Bengio, Y\., Schuurmans, D\., Lafferty, J\. D\., Williams, C\. K\. I\., and Culotta, A\. \(eds\.\),*Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009\. Proceedings of a meeting held 7\-10 December 2009, Vancouver, British Columbia, Canada*, pp\. 1015–1023\. Curran Associates, Inc\., 2009\.
- Lee et al\. \(2024\)Lee, S\., Wang, Y\., Etukuru, H\., Kim, H\. J\., Shafiullah, N\. M\. M\., and Pinto, L\.Behavior generation with latent actions\.In*Forty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024*\. OpenReview\.net, 2024\.
- Li et al\. \(2025a\)Li, M\., Xia, K\., Zhang, G\., Wang, Z\., Tao, G\., Pan, S\., Zhai, J\., and Ma, S\.Editor: Effective and interpretable prompt inversion for text\-to\-image diffusion models\.*arXiv preprint arXiv:2506\.03067*, 2025a\.
- Li et al\. \(2025b\)Li, M\., Zhang, R\., Wen, Z\., Pan, S\., da Silva, B\. C\., Zhai, J\., and Ma, S\.Promptminer: Black\-box prompt stealing against text\-to\-image generative models via reinforcement learning and fuzz optimization\.*arXiv preprint arXiv:2511\.22119*, 2025b\.
- Li et al\. \(2024a\)Li, S\., Krohn, R\., Chen, T\., Ajay, A\., Agrawal, P\., and Chalvatzaki, G\.Learning multimodal behaviors from scratch with diffusion policy gradient\.In Globersons, A\., Mackey, L\., Belgrave, D\., Fan, A\., Paquet, U\., Tomczak, J\. M\., and Zhang, C\. \(eds\.\),*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024*, 2024a\.
- Li et al\. \(2024b\)Li, S\., Krohn, R\., Chen, T\., Ajay, A\., Agrawal, P\., and Chalvatzaki, G\.Learning multimodal behaviors from scratch with diffusion policy gradient\.*Advances in Neural Information Processing Systems*, 37:38456–38479, 2024b\.
- Li et al\. \(2025c\)Li, Z\., Yang, R\., Chen, R\., Luo, Z\., and Chen, L\.Adpro: a test\-time adaptive diffusion policy via manifold\-constrained denoising and task\-aware initialization for robotic manipulation\.*arXiv preprint arXiv:2508\.06266*, 2025c\.
- Lynch et al\. \(2020\)Lynch, C\., Florence, P\., and et al\.Learning latent plans from play\.In*Conference on Robot Learning*, 2020\.
- Ma et al\. \(2025\)Ma, H\., Nabati, O\., Rosenberg, A\., Dai, B\., Lang, O\., Szpektor, I\., Boutilier, C\., Li, N\., Mannor, S\., Shani, L\., et al\.Reinforcement learning with discrete diffusion policies for combinatorial action spaces\.*arXiv preprint arXiv:2509\.22963*, 2025\.
- Makarova et al\. \(2025\)Makarova, M\., Liu, Q\., and Tsetserukou, D\.Diffusionrl: Efficient training of diffusion policies for robotic grasping using rl\-adapted large\-scale datasets\.*arXiv preprint arXiv:2505\.18876*, 2025\.
- Mandlekar et al\. \(2021\)Mandlekar, A\., Xu, D\., Wong, J\., Nasiriany, S\., Wang, C\., Kulkarni, R\., Fei\-Fei, L\., Savarese, S\., Zhu, Y\., and Fan, L\.What matters in learning from offline demonstrations for robot manipulation\.In*Conference on Robot Learning \(CoRL\)*, 2021\.
- Masson et al\. \(2016\)Masson, W\., Ranchod, P\., and Konidaris, G\. D\.Reinforcement learning with parameterized actions\.In Schuurmans, D\. and Wellman, M\. P\. \(eds\.\),*Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12\-17, 2016, Phoenix, Arizona, USA*, pp\. 1934–1940\. AAAI Press, 2016\.doi:10\.1609/AAAI\.V30I1\.10226\.
- Miao et al\. \(2024\)Miao, Z\., Wang, J\., Wang, Z\., Yang, Z\., Wang, L\., Qiu, Q\., and Liu, Z\.Training diffusion models towards diverse image generation with reinforcement learning\.In*IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16\-22, 2024*, pp\. 10844–10853\. IEEE, 2024\.doi:10\.1109/CVPR52733\.2024\.01031\.
- Moser et al\. \(2025\)Moser, B\. B\., Shanbhag, A\. S\., Raue, F\., Frolov, S\., Palacio, S\., and Dengel, A\.Diffusion models, image super\-resolution, and everything: A survey\.*IEEE Trans\. Neural Networks Learn\. Syst\.*, 36\(7\):11793–11813, 2025\.doi:10\.1109/TNNLS\.2024\.3476671\.
- Nichol & Dhariwal \(2021\)Nichol, A\. and Dhariwal, P\.Improved denoising diffusion probabilistic models\.In*Proceedings of the 38th International Conference on Machine Learning \(ICML\)*, 2021\.
- Pan et al\. \(2025a\)Pan, C\., Anantharaman, G\., Huang, N\.\-C\., Jin, C\., Pfrommer, D\., Yuan, C\., Permenter, F\., Qu, G\., Boffi, N\., Shi, G\., et al\.Much ado about noising: Dispelling the myths of generative robotic control\.*arXiv preprint arXiv:2512\.01809*, 2025a\.
- Pan et al\. \(2025b\)Pan, Y\., Feng, R\., Dai, Q\., Wang, Y\., Lin, W\., Guo, M\., Luo, C\., and Zheng, N\.Semantics lead the way: Harmonizing semantic and texture modeling with asynchronous latent diffusion\.*arXiv preprint arXiv:2512\.04926*, 2025b\.
- Park et al\. \(2023\)Park, Y\.\-H\., Kwon, M\., Jo, J\., and Uh, Y\.Unsupervised discovery of semantic latent directions in diffusion models\.*arXiv preprint arXiv:2302\.01245*, 2023\.
- Perez et al\. \(2018\)Perez, E\., Strub, F\., De Vries, H\., Dumoulin, V\., and Courville, A\.Film: Visual reasoning with a general conditioning layer\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 32, 2018\.
- Qiao et al\. \(2025\)Qiao, R\., Cheng, J\., Dai, X\., Tian, Y\., and Lv, Y\.Offline reinforcement learning with discrete diffusion skills\.*arXiv preprint arXiv:2503\.20176*, 2025\.
- Queisser & Steil \(2018\)Queisser, J\. F\. and Steil, J\. J\.Bootstrapping of parameterized skills through hybrid optimization in task and policy spaces\.*Frontiers Robotics AI*, 5:49, 2018\.doi:10\.3389/FROBT\.2018\.00049\.
- Reuss et al\. \(2023\)Reuss, M\., Li, M\., Jia, X\., and Lioutikov, R\.Goal\-conditioned imitation learning using score\-based diffusion policies\.In*Proceedings of Robotics: Science and Systems \(RSS\)*, 2023\.
- Shan et al\. \(2025\)Shan, Z\., Fan, C\., Qiu, S\., Shi, J\., and Bai, C\.Forward kl regularized preference optimization for aligning diffusion policies\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 39, pp\. 14386–14395, 2025\.
- Sohl\-Dickstein et al\. \(2015\)Sohl\-Dickstein, J\., Weiss, E\. A\., Maheswaranathan, N\., and Ganguli, S\.Deep unsupervised learning using nonequilibrium thermodynamics\.In*Proceedings of the 32nd International Conference on Machine Learning \(ICML\)*, 2015\.
- Song & Ermon \(2019\)Song, Y\. and Ermon, S\.Generative modeling by estimating gradients of the data distribution\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2019\.
- Song et al\. \(2021\)Song, Y\., Sohl\-Dickstein, J\., Kingma, D\. P\., Kumar, A\., Ermon, S\., and Poole, B\.Score\-based generative modeling through stochastic differential equations\.In*9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021*\. OpenReview\.net, 2021\.
- Sutton et al\. \(1999\)Sutton, R\. S\., Precup, D\., and Singh, S\.Between mdps and semi\-mdps: A framework for temporal abstraction in reinforcement learning\.*Artificial Intelligence*, 1999\.
- Vezhnevets et al\. \(2017\)Vezhnevets, A\. S\., Osindero, S\., Schaul, T\., Heess, N\., Jaderberg, M\., Silver, D\., and Kavukcuoglu, K\.Feudal networks for hierarchical reinforcement learning\.In Precup, D\. and Teh, Y\. W\. \(eds\.\),*Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6\-11 August 2017*, volume 70 of*Proceedings of Machine Learning Research*, pp\. 3540–3549\. PMLR, 2017\.
- Wagenmaker et al\. \(2025\)Wagenmaker, A\., Nakamoto, M\., Zhang, Y\., Park, S\., Yagoub, W\., Nagabandi, A\., Gupta, A\., and Levine, S\.Steering your diffusion policy with latent space reinforcement learning\.*Conference on Robot Learning \(CoRL\)*, 2025\.
- Wang et al\. \(2023\)Wang, Z\., Hunt, J\. J\., and Zhou, M\.Diffusion policies as an expressive policy class for offline reinforcement learning\.In*The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023*\. OpenReview\.net, 2023\.
- Wang et al\. \(2025\)Wang, Z\., Liu, J\., and Pan, L\.Learning intractable multimodal policies with reparameterization and diversity regularization\.*arXiv preprint arXiv:2511\.01374*, 2025\.
- Wolf et al\. \(2025\)Wolf, R\., Shi, Y\., Liu, S\., and Rayyes, R\.Diffusion models for robotic manipulation: A survey\.*Frontiers in Robotics and AI*, 12:1606247, 2025\.
- Wu et al\. \(2025\)Wu, K\., Zhu, Y\., Li, J\., Wen, J\., Liu, N\., Xu, Z\., and Tang, J\.Discrete policy: Learning disentangled action space for multi\-task robotic manipulation\.In*IEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19\-23, 2025*, pp\. 8811–8818\. IEEE, 2025\.doi:10\.1109/ICRA55743\.2025\.11127630\.
- Xu et al\. \(2025\)Xu, C\., Guo, J\., Liang, Y\., Huang, H\., Zou, H\., Zheng, X\., Yu, S\., Chu, X\., Cao, J\., and Wang, T\.Diffusion models for reinforcement learning: Foundations, taxonomy, and development\.*arXiv preprint arXiv:2510\.12253*, 2025\.
- Yang et al\. \(2024\)Yang, B\., Su, H\., Gkanatsios, N\., Ke, T\.\-W\., Jain, A\., Schneider, J\., and Fragkiadaki, K\.Diffusion\-ES: Gradient\-free planning with diffusion for autonomous driving and zero\-shot instruction following\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*, 2024\.
- Yu et al\. \(2025\)Yu, H\., Jin, Y\., He, Y\., and Sui, W\.Efficient task\-specific conditional diffusion policies: Shortcut model acceleration and so \(3\) optimization\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pp\. 4174–4183, 2025\.
- Zhang et al\. \(2024\)Zhang, R\., Fu, H\., Miao, Y\., and Konidaris, G\.Model\-based reinforcement learning for parameterized action spaces\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024\.
- Zheng et al\. \(2024\)Zheng, R\., Cheng, C\.\-A\., III, H\. D\., Huang, F\., and Kolobov, A\.Prise: Llm\-style sequence compression for learning temporal action abstractions in control\.In*Forty\-first International Conference on Machine Learning*, 2024\.
- Zhu et al\. \(2024\)Zhu, Y\., Xie, J\., Wu, Y\. N\., and Gao, R\.Learning energy\-based models by cooperative diffusion recovery likelihood\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.
- Zhu et al\. \(2023\)Zhu, Z\., Zhao, H\., He, H\., Zhong, Y\., Zhang, S\., Yu, Y\., and Zhang, W\.Diffusion models for reinforcement learning: A survey\.*arXiv preprint arXiv:2311\.01223*, 2023\.

## Appendix AAlgorithms and Model Architecture

### A\.1DTW and Soft\-DTW Formulation

To establish a geometry\-aligned behavior space, we require a distance metricδX\\delta^\{X\}that is invariant to temporal variations and execution speeds\. We use Dynamic Time Warping \(DTW\) and its differentiable counterpart, Soft\-DTW, to compare behavior tracesX​\(τi\)X\(\\tau\_\{i\}\)andX​\(τj\)X\(\\tau\_\{j\}\)\. Let these traces be represented as sequencesA=\(a1,…,an\)A=\(a\_\{1\},\\dots,a\_\{n\}\)andB=\(b1,…,bm\)B=\(b\_\{1\},\\dots,b\_\{m\}\)\.

Dynamic Time Warping \(DTW\):DTW seeks an optimal alignment betweenAAandBBby minimizing the cumulative cost over a cost matrixD∈ℝn×mD\\in\\mathbb\{R\}^\{n\\times m\}, whereDi,j=Δ​\(ai,bj\)D\_\{i,j\}=\\Delta\(a\_\{i\},b\_\{j\}\)represents the local distance between points\. An alignment is defined by a binary warping path \(alignment matrix\)𝐀∈\{0,1\}n×m\\mathbf\{A\}\\in\\\{0,1\\\}^\{n\\times m\}, where𝐀i,j=1\\mathbf\{A\}\_\{i,j\}=1ifaia\_\{i\}is matched tobjb\_\{j\}\. The set of all valid paths𝒜n,m\\mathcal\{A\}\_\{n,m\}must satisfy boundary, continuity, and monotonicity constraints\. The standard DTW distance is defined as:

D​T​W​\(A,B\)=min𝐀∈𝒜n,m⁡⟨𝐀,D⟩\.DTW\(A,B\)=\\min\_\{\\mathbf\{A\}\\in\\mathcal\{A\}\_\{n,m\}\}\\langle\\mathbf\{A\},D\\rangle\.\(13\)This is computed efficiently via dynamic programming with the recurrence:

ri,j=Di,j\+min⁡\{ri−1,j,ri,j−1,ri−1,j−1\}\.r\_\{i,j\}=D\_\{i,j\}\+\\min\\\{r\_\{i\-1,j\},r\_\{i,j\-1\},r\_\{i\-1,j\-1\}\\\}\.\(14\)
Soft\-DTW Formulation:Themin\\minoperator in standard DTW is non\-differentiable, preventing backpropagation through the geometry lossℒg​e​o\\mathcal\{L\}\_\{geo\}\. Following Cuturi & Blondel \(2017\), we replace the hardmin\\minwith a smoothedsoftminγ\\text\{softmin\}\_\{\\gamma\}operator with a parameterγ\>0\\gamma\>0:

s​o​f​t​D​T​Wγ​\(A,B\)=−γ​log​∑𝐀∈𝒜n,mexp⁡\(−⟨𝐀,D⟩γ\)\.softDTW\_\{\\gamma\}\(A,B\)=\-\\gamma\\log\\sum\_\{\\mathbf\{A\}\\in\\mathcal\{A\}\_\{n,m\}\}\\exp\\left\(\-\\frac\{\\langle\\mathbf\{A\},D\\rangle\}\{\\gamma\}\\right\)\.\(15\)Asγ→0\\gamma\\to 0,s​o​f​t​D​T​WγsoftDTW\_\{\\gamma\}converges to the standard DTW distance\. The value can be computed inO​\(n​m\)O\(nm\)time using the following DP recurrence:

ri,j=Di,j\+softminγ​\{ri−1,j,ri,j−1,ri−1,j−1\},r\_\{i,j\}=D\_\{i,j\}\+\\text\{softmin\}\_\{\\gamma\}\\\{r\_\{i\-1,j\},r\_\{i,j\-1\},r\_\{i\-1,j\-1\}\\\},\(16\)wheresoftminγ​\(x1,…,xk\)=−γ​log​∑i=1kexp⁡\(−xi/γ\)\\text\{softmin\}\_\{\\gamma\}\(x\_\{1\},\\dots,x\_\{k\}\)=\-\\gamma\\log\\sum\_\{i=1\}^\{k\}\\exp\(\-x\_\{i\}/\\gamma\)\.

Gradient Computation:The primary advantage of Soft\-DTW is its differentiability with respect to the input sequences\. The gradient with respect to the cost matrixDD\(and thus the behavior latentzzvia the trajectory features\) is given by:

∇Ds​o​f​t​D​T​Wγ​\(A,B\)=E=\[ei,j\],\\nabla\_\{D\}softDTW\_\{\\gamma\}\(A,B\)=E=\[e\_\{i,j\}\],\(17\)whereEEis the expected alignment matrix under the Gibbs distributionP​\(𝐀\)∝exp⁡\(−⟨𝐀,D⟩/γ\)P\(\\mathbf\{A\}\)\\propto\\exp\(\-\\langle\\mathbf\{A\},D\\rangle/\\gamma\)\. This allows the PDP framework to anchor the Euclidean geometry of the behavior manifoldzzto the physical similarity of trajectories by minimizing the alignment\-based distanceδi​jX\\delta\_\{ij\}^\{X\}\.

Algorithm 1Soft\-DTW Forward Pass1:Input:Cost matrix

D∈ℝn×mD\\in\\mathbb\{R\}^\{n\\times m\}, smoothing parameter

γ\\gamma\.

2:

R∈ℝ\(n\+1\)×\(m\+1\)←∞R\\in\\mathbb\{R\}^\{\(n\+1\)\\times\(m\+1\)\}\\leftarrow\\infty
3:

R0,0←0R\_\{0,0\}\\leftarrow 0
4:for

i=1i=1to

nndo

5:for

j=1j=1to

mmdo

6:

ri,j=Di,j\+softminγ​\{Ri−1,j,Ri,j−1,Ri−1,j−1\}r\_\{i,j\}=D\_\{i,j\}\+\\text\{softmin\}\_\{\\gamma\}\\\{R\_\{i\-1,j\},R\_\{i,j\-1\},R\_\{i\-1,j\-1\}\\\}
7:endfor

8:endfor

9:Output:

Rn,mR\_\{n,m\}

### A\.2Joint Training of the Behavior Latent and Policy

The training of PDP requires balancing two distinct objectives: establishing a structured, geometry\-aligned latent manifoldzzand learning a denoiserϵθ\\epsilon\_\{\\theta\}that can accurately map these latents to high\-dimensional action sequences\. A naive joint optimization could allow the diffusion loss—which is primarily a mode\-fitting objective—to distort the metric\-preserving properties of the latent space enforced by the soft\-DTW geometry lossℒg​e​o\\mathcal\{L\}\_\{geo\}\.

To prevent this representation collapse, we use an alternating optimization scheme\. In the first phase, we refine the trajectory encoderEϕE\_\{\\phi\}and symmetric decoderDψD\_\{\\psi\}to ensure that the latent spacezzpreserves behavioral information and captures semantic similarities between physical trajectories\. In the second phase, we optimize the denoiser parametersθ\\thetaby conditioning on the latentzzproduced by the encoder\. Crucially, we apply a stop\-gradient operators​g​\(⋅\)sg\(\\cdot\)to the behavior latent during this phase, ensuringzzremains a fixed, reliable control handle for the diffusion process\. We provide the detailed procedural breakdown of this joint training scheme in Algorithm[2](https://arxiv.org/html/2606.00336#alg2)\.

Algorithm 2Joint Training of Parameterized Diffusion Policy \(PDP\)1:Initialize:Encoder

EϕE\_\{\\phi\}, Decoder

DψD\_\{\\psi\}, and Denoiser

ϵθ\\epsilon\_\{\\theta\}\.

2:whilenot convergeddo

3:Sample batch of demonstrations

\{τi,τj\}\\\{\\tau\_\{i\},\\tau\_\{j\}\\\}from dataset\.

4:Step 1: Embedding Refinement

5:Compute physical distances

δi​jX=s​o​f​t​D​T​Wγ​\(X​\(τi\),X​\(τj\)\)\\delta\_\{ij\}^\{X\}=softDTW\_\{\\gamma\}\(X\(\\tau\_\{i\}\),X\(\\tau\_\{j\}\)\)\.

6:Map to latents

z=μϕ​\(τ\)\+σϕ​\(τ\)⊙ξz=\\mu\_\{\\phi\}\(\\tau\)\+\\sigma\_\{\\phi\}\(\\tau\)\\odot\\xivia trajectory encoder\.

7:Update

\(ϕ,ψ\)\(\\phi,\\psi\)via

∇ϕ,ψ\(ℒr​e​c\+βK​L​ℒK​L\+βg​e​o​ℒg​e​o\)\\nabla\_\{\\phi,\\psi\}\(\\mathcal\{L\}\_\{rec\}\+\\beta\_\{KL\}\\mathcal\{L\}\_\{KL\}\+\\beta\_\{geo\}\\mathcal\{L\}\_\{geo\}\)\.

8:Step 2: Denoiser Optimization

9:Compute fixed behavior latent

z¯=s​g​\(μϕ​\(τ\)\)\\bar\{z\}=sg\(\\mu\_\{\\phi\}\(\\tau\)\)\.

10:Update

θ\\thetavia

∇θ𝔼τ,t,k,ϵ​\[‖ϵ−ϵθ​\(Atk,k,ct,z¯\)‖22\]\\nabla\_\{\\theta\}\\mathbb\{E\}\_\{\\tau,t,k,\\epsilon\}\[\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(A\_\{t\}^\{k\},k,c\_\{t\},\\bar\{z\}\)\\\|\_\{2\}^\{2\}\]\.

11:endwhile

### A\.3Test\-Time Latent Adaptation via Gradient Descent

A core advantage of PDP is its ability to perform fast, data\-light adaptation to novel environmental constraints without updating millions of network weights\. Standard diffusion policies often rely on stochastic sampling to discover feasible behaviors, which is ill\-conditioned under significant constraint shifts where successful trajectories occupy a small fraction of the total noise space\. In contrast, PDP treats the behavior manifold as a structured search space, replacing “luck\-based” sampling with stable, gradient\-based optimization inzz\.

The adaptation process begins with a “semantic warm\-start,” where we use the frozen encoderEϕE\_\{\\phi\}to map a target demonstrationτ~\\tilde\{\\tau\}\(or a partial behavior trace\) to an initial latentz0z\_\{0\}code\. This places the optimization in the correct region of the behavior manifold, significantly improving convergence stability compared to random initialization\. We then iteratively refine this latent representation by minimizing a diffusion\-consistent fitting objectiveℒf​i​t\\mathcal\{L\}\_\{fit\}, which identifies the behavior mode that maximizes the likelihood of the target actions under the frozen denoiser\. The complete test\-time latent adaptation procedure is formalized in Algorithm[3](https://arxiv.org/html/2606.00336#alg3)\.

Algorithm 3Test\-Time Latent Adaptation1:Input:Target demonstration

τ~\\tilde\{\\tau\}, frozen models

\{Eϕ,ϵθ\}\\\{E\_\{\\phi\},\\epsilon\_\{\\theta\}\\\}\.

2:Initialization:

z0=μϕ​\(X​\(τ~\)\)z\_\{0\}=\\mu\_\{\\phi\}\(X\(\\tilde\{\\tau\}\)\)\{Semantic Warm\-start\}

3:forstep

s=1s=1to

SSdo

4:Sample target action chunks

A~t\\tilde\{A\}\_\{t\}and context

c~t\\tilde\{c\}\_\{t\}\.

5:Compute

ℒf​i​t​\(z\)=𝔼k,ϵ​\[‖ϵ−ϵθ​\(A~tk,k,c~t,z\)‖22\]\+λ​‖z‖22\\mathcal\{L\}\_\{fit\}\(z\)=\\mathbb\{E\}\_\{k,\\epsilon\}\[\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(\\tilde\{A\}\_\{t\}^\{k\},k,\\tilde\{c\}\_\{t\},z\)\\\|\_\{2\}^\{2\}\]\+\\lambda\\\|z\\\|\_\{2\}^\{2\}\.

6:Update latent:

z←z−η​∇zℒf​i​t​\(z\)z\\leftarrow z\-\\eta\\nabla\_\{z\}\\mathcal\{L\}\_\{fit\}\(z\)\.

7:endfor

8:Output:Optimized

z∗z^\{\*\}for execution rollouts\.

### A\.4Analysis of Controllability and Generalization

This section provides an analysis of why PDP improves over standard DP in the constraint\-induced behavior shift setting studied in this paper \(Fig\.[1](https://arxiv.org/html/2606.00336#S1.F1)\)\. The key distinction is between \(i\)*representing*multimodal behaviors in offline demonstrations, and \(ii\)*steering*a policy toward a specific feasible strategy when constraints shift\. We show that DP addresses \(i\) better than regression\-based BC, but its default control interface—high\-dimensional sampling noise—is ill\-conditioned \(ii\)\. PDP resolves this by exposing a low\-dimensional behavior latentzzwhose geometry is explicitly aligned with physical trajectory similarity via soft\-DTW, making both training and test\-time steering well\-conditioned\.

##### Diffusion policies can represent multimodal behaviors while regression BC collapses them\.

Consider a contextctc\_\{t\}\(e\.g\., a history encoder output\) and an action chunkAtA\_\{t\}of horizonHH\. In our*True multimodal*datasets, the expert conditional distribution is inherently multi\-valued:

ptrain​\(At∣ct\)=∑m=1Mp​\(m∣ct\)​p​\(At∣ct,m\),p\_\{\\text\{train\}\}\(A\_\{t\}\\mid c\_\{t\}\)=\\sum\_\{m=1\}^\{M\}p\(m\\mid c\_\{t\}\)\\,p\(A\_\{t\}\\mid c\_\{t\},m\),\(18\)wheremmindexes distinct trajectory modes \(e\.g\., different approach paths or grasp affordances\)\. A standard regression BC objective \(e\.g\., squared error on action chunks\) learns a deterministic predictorπθ​\(ct\)\\pi\_\{\\theta\}\(c\_\{t\}\)that minimizes𝔼​\[‖At−πθ​\(ct\)‖22\]\\mathbb\{E\}\[\\\|A\_\{t\}\-\\pi\_\{\\theta\}\(c\_\{t\}\)\\\|\_\{2\}^\{2\}\]\. The population minimizer is the conditional meanπ∗​\(ct\)=𝔼​\[At∣ct\]\\pi^\{\*\}\(c\_\{t\}\)=\\mathbb\{E\}\[A\_\{t\}\\mid c\_\{t\}\], which averages across modes in Eq\. \([18](https://arxiv.org/html/2606.00336#A1.E18)\)\. When modes are geometrically incompatible, this mean corresponds to an “in\-between” trajectory that may be physically infeasible\. In contrast, diffusion policies model a*conditional distribution*over chunks and can sample distinct behaviors from different noise seeds, enabling multi\-strategy rollouts when the data are truly multimodal\.

At the same time, recent analysis of generative control policies\(Pan et al\.,[2025a](https://arxiv.org/html/2606.00336#bib.bib43)\)emphasizes that on many popular BC benchmarks, performance gains often arise not from learned multi\-modality per se, but from the combination of stochasticity injection during training and supervised iterative computation during inference\. Our setting is complementary: we intentionally construct benchmarks where multiple distinct strategies exist under near\-identical initial conditions, so explicit mode handling is essential—and we further require a*controllable*interface to reliably select or synthesize feasible strategies under constraint shifts\.

##### Mode disambiguation yields training relief\.

DP trains a denoiser to predict diffusion noise under the forward process

Atk=α¯k​At\+1−α¯k​ϵ,ϵ∼𝒩​\(0,I\),A\_\{t\}^\{k\}=\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\\,A\_\{t\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{k\}\}\\,\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(19\)using an MSE noise\-prediction loss \([9](https://arxiv.org/html/2606.00336#S4.E9)\)\. For notational brevity, letx≜\(Atk,k,ct\)x\\triangleq\(A\_\{t\}^\{k\},k,c\_\{t\}\)\. Under squared loss, the Bayes\-optimal predictor is the conditional meanϵ∗​\(x\)=𝔼​\[ϵ∣x\]\\epsilon^\{\*\}\(x\)=\\mathbb\{E\}\[\\epsilon\\mid x\], and the minimum achievable risk is the conditional variance𝔼​\[Var​\(ϵ∣x\)\]\\mathbb\{E\}\[\\mathrm\{Var\}\(\\epsilon\\mid x\)\]\. Whenp​\(At∣ct\)p\(A\_\{t\}\\mid c\_\{t\}\)is multimodal,xxalone does not identify the underlying modemm, soϵ\\epsilonremains highly uncertain givenxx, leading to a harder regression problem and unstable “averaged” denoising directions\.

PDP introduces a behavior latentzzto separate*mode selection*from*within\-mode denoising*\. Writing the conditional risk withzzgives the variance decomposition \(law of total variance\):

Var​\(ϵ∣x\)=𝔼z​\[Var​\(ϵ∣x,z\)\]\+Varz​\(𝔼​\[ϵ∣x,z\]\)\.\\mathrm\{Var\}\(\\epsilon\\mid x\)=\\mathbb\{E\}\_\{z\}\\\!\\left\[\\mathrm\{Var\}\(\\epsilon\\mid x,z\)\\right\]\+\\mathrm\{Var\}\_\{z\}\\\!\\left\(\\mathbb\{E\}\[\\epsilon\\mid x,z\]\\right\)\.\(20\)The first term𝔼z​\[Var​\(ϵ∣x,z\)\]\\mathbb\{E\}\_\{z\}\[\\mathrm\{Var\}\(\\epsilon\\mid x,z\)\]represents*residual within\-mode uncertainty*\(intra\-mode noise, partial observability, and diffusion corruption\) after a behavior is specified\. The second term,

Δmode​\(x\)≜Varz​\(𝔼​\[ϵ∣x,z\]\),\\Delta\_\{\\text\{mode\}\}\(x\)\\ \\triangleq\\ \\mathrm\{Var\}\_\{z\}\\\!\\left\(\\mathbb\{E\}\[\\epsilon\\mid x,z\]\\right\),\(21\)is the*structural between\-mode variance*: it measures how much the optimal denoising target changes when we vary the desired behavior\. This is precisely the “variance gap” paid by an unconditioned DP when it must implicitly average over multiple incompatible strategies\.

Importantly, in PDP this decomposition is not merely formal:zzis trained to have a specific semantic meaning\. Our encoder is optimized with a geometry loss \([7](https://arxiv.org/html/2606.00336#S4.E7)\) enforcing

‖z¯​\(τi\)−z¯​\(τj\)‖2≈κ​dsoftDTW​\(X​\(τi\),X​\(τj\)\),\\\|\\bar\{z\}\(\\tau\_\{i\}\)\-\\bar\{z\}\(\\tau\_\{j\}\)\\\|\_\{2\}\\approx\\kappa\\,d\_\{\\text\{softDTW\}\}\(X\(\\tau\_\{i\}\),X\(\\tau\_\{j\}\)\),\(22\)whereX​\(τ\)X\(\\tau\)is a behavior trace capturing the physical trajectory geometry\. Since distinct modes in our datasets correspond to large changes in the trajectory trace \(e\.g\., different approach paths or contact patterns\), this alignment organizeszzinto compact, well\-separated regions that correlate with mode identity\. Consequently, conditioning onzzmakesp​\(m∣x,z\)p\(m\\mid x,z\)sharply peaked \(Dirac\-like in the idealized limit\), so the denoiser learns a simpler*mode\-specific*denoising problem rather than a global mixture\. Empirically, this improved conditioning of the training process is reflected as a large reduction in diffusion training loss for latent\-conditioned models versus unconditioned or weakly conditioned baselines \(Sec\.[5\.3](https://arxiv.org/html/2606.00336#S5.SS3)\), and yields substantially improved reliability in high\-mode tasks \(Table[1](https://arxiv.org/html/2606.00336#S5.T1)\)\.

A further interpretation follows from rearranging Eq\. \([19](https://arxiv.org/html/2606.00336#A1.E19)\):

ϵ=Atk−α¯k​At1−α¯k\.\\epsilon=\\frac\{A\_\{t\}^\{k\}\-\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\\,A\_\{t\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{k\}\}\}\.\(23\)Conditioned on\(Atk,k,ct\)\(A\_\{t\}^\{k\},k,c\_\{t\}\), uncertainty inϵ\\epsilonis induced by uncertainty in the clean chunkAtA\_\{t\}\. Since Eq\. \([23](https://arxiv.org/html/2606.00336#A1.E23)\) is affine inAtA\_\{t\}, we have that

Var​\(ϵ∣x\)=α¯k1−α¯k​Var​\(At∣x\),Var​\(ϵ∣x,z\)=α¯k1−α¯k​Var​\(At∣x,z\)\.\\mathrm\{Var\}\(\\epsilon\\mid x\)=\\frac\{\\bar\{\\alpha\}\_\{k\}\}\{1\-\\bar\{\\alpha\}\_\{k\}\}\\,\\mathrm\{Var\}\(A\_\{t\}\\mid x\),\\qquad\\mathrm\{Var\}\(\\epsilon\\mid x,z\)=\\frac\{\\bar\{\\alpha\}\_\{k\}\}\{1\-\\bar\{\\alpha\}\_\{k\}\}\\,\\mathrm\{Var\}\(A\_\{t\}\\mid x,z\)\.\(24\)Thus,Δmode​\(x\)\\Delta\_\{\\text\{mode\}\}\(x\)in Eq\. \([21](https://arxiv.org/html/2606.00336#A1.E21)\) directly corresponds \(up to the same diffusion\-dependent scaling\) to variance in the conditional mean of the clean action chunk across behavior latents, i\.e\., to the separation between strategy manifolds\. PDP reduces this structural variance by makingzzan explicit index of the trajectory geometry\.

##### Low\-dimensional steering enables controllable adaptation under constraint shifts\.

While DP can represent multimodality, its default control interface for selecting behaviors is the sampling noiseϵ∈ℝH​da\\epsilon\\in\\mathbb\{R\}^\{Hd\_\{a\}\}\(or equivalently the initial latent of the reverse process\)\. For multimodal tasks, each strategy corresponds to a basin of attraction in this high\-dimensional space\. Under constraint\-induced shifts, feasible behaviors may occupy a small and hard\-to\-reach subset of noise space, making both random sampling and direct noise\-space optimization ill\-conditioned\.

PDP replaces this implicit interface with an explicit low\-dimensional control handlez∈ℝdzz\\in\\mathbb\{R\}^\{d\_\{z\}\}, wheredz≪H​dad\_\{z\}\\ll Hd\_\{a\}, and where distances inzzare aligned with physical trajectory similarity \(Eq\. \([22](https://arxiv.org/html/2606.00336#A1.E22)\)\)\. At test time, we fitzzby minimizing the diffusion\-consistent objective \([10](https://arxiv.org/html/2606.00336#S4.E10)\):

Lfit​\(z;τ~\)=𝔼t,k,ϵ​\[‖ϵ−ϵθ​\(A~tk,k,c~t,z\)‖22\]\+λ​‖z‖22,L\_\{\\text\{fit\}\}\(z;\\tilde\{\\tau\}\)=\\mathbb\{E\}\_\{t,k,\\epsilon\}\\\!\\left\[\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(\\tilde\{A\}\_\{t\}^\{k\},k,\\tilde\{c\}\_\{t\},z\)\\\|\_\{2\}^\{2\}\\right\]\+\\lambda\\\|z\\\|\_\{2\}^\{2\},\(25\)initialized from a*semantic warm\-start*z0=μϕ​\(X​\(τ~\)\)z\_\{0\}=\\mu\_\{\\phi\}\(X\(\\tilde\{\\tau\}\)\)\. The first term identifies the behavior latent under which the frozen denoiser assigns high likelihood to the demonstrated actions, while the quadratic penalty is consistent with the Gaussian prior induced by the KL term in \([5](https://arxiv.org/html/2606.00336#S4.E5)\) and keeps the optimization within the support of the learned manifold\. Becausezzis a geometry\-aligned coordinate chart, gradient descent inzzcorresponds to coherent movement in trajectory space: small changes inzzinduce small, predictable changes in the generated behavior\. This enables both \(i\) reliable selection of the unique feasible training mode \(Scenes 1–2\), and \(ii\) navigation to new regions of the manifold to discover novel strategies when no training mode is feasible \(Scene 3\), without updating policy weights\.

##### Generalization as navigable interpolation in behavior space\.

Finally, Eq\. \([22](https://arxiv.org/html/2606.00336#A1.E22)\) provides a principled explanation for why interpolation inzzyields semantically meaningful behavior interpolation\. Ifz​\(λ\)=\(1−λ\)​zi\+λ​zjz\(\\lambda\)=\(1\-\\lambda\)z\_\{i\}\+\\lambda z\_\{j\}lies between two mode regions, then its Euclidean position corresponds to an intermediate soft\-DTW distance in the behavior\-trace metric, so the policy is biased toward synthesizing trajectories that are physically intermediate between the two endpoint strategies\. This effect is visible in our latent traversal visualizations \(Fig\.[5](https://arxiv.org/html/2606.00336#S5.F5)\): in an isometric manifold, midpoint latents generate smooth intermediate paths, whereas an unaligned latent collapses to unrelated modes\.

Overall, PDP improves controllability and generalization by \(i\) providing*training relief*through mode disambiguation, reducing the inter\-mode variance termΔmode​\(x\)\\Delta\_\{\\text\{mode\}\}\(x\)that otherwise burdens the denoiser, and \(ii\) providing*low\-dimensional steering*through a geometry\-aligned latentzzthat converts diffusion from noise\-driven diversity into a stable and optimizable mechanism for behavior selection, interpolation, and rapid adaptation under constraint\-induced behavior shifts\.

## Appendix BDomain Specifications and Implementation Details

### B\.1Manipulation Task Descriptions and Multimodal Dataset Collection

#### B\.1\.1Simulation Setup

To evaluate the efficacy of PDP in handling high\-dimensional, multimodal robotic control, we use a benchmark suite of manipulation domains from RLBench, includingCloseDrawer,OpenDrawer,PickUpCup,PlaceBlock, andMeatOffGrill\. RLBench is a standardized robotic manipulation benchmark built on top of a physics\-based simulator with a fixed robot embodiment and a large collection of goal\-conditioned tasks\. Beyond RLBench, we further evaluate on four other domains drawn from three complementary benchmarks \(Figure[6](https://arxiv.org/html/2606.00336#A2.F6)\):OpenDoorfrom robomimic\(Mandlekar et al\.,[2021](https://arxiv.org/html/2606.00336#bib.bib38)\),OpenMicrowavefrom the Franka Kitchen environment\(Gupta et al\.,[2019](https://arxiv.org/html/2606.00336#bib.bib16)\), andAvoiding24andAvoiding32from the D3IL benchmark\(Jia et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib25)\)\. Spanning distinct simulation engines, robot embodiments, and task structures, these benchmarks together provide a comprehensive testbed for multimodal behavior learning and adaptation\. Each task is formulated as a finite\-horizon Markov Decision Process \(MDP\) defined by\(𝒮,𝒜,𝒫,r,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},r,\\gamma\), where the statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}encodes the instantaneous configuration of the robot and its environment, including task\-relevant object poses and robot kinematics, and actions correspond to continuous end\-effector control executed through stochastic transition dynamics𝒫\\mathcal\{P\}\. For all domains, we introduce targeted modifications to the original task setups to make them better suited for controlled multimodal data collection, including the insertion of obstacles between the gripper and task\-relevant objects or targets, as well as the restriction of feasible grasping affordances\. These modifications are designed to induce multiple distinct but valid strategies under identical initial conditions\. We provide detailed descriptions of all task setups below\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x6.png)Figure 6:The four other manipulation domains we evaluate on, aside from those shown in the main text:OpenDoor\(robomimic\),OpenMicrowave\(Franka Kitchen\), andAvoiding24andAvoiding32\(D3IL\)\.- •CloseDrawer\.The objective ofCloseDraweris to close an open drawer by pushing its handle to a target closed configuration\. To induce multimodality, we introduce a static obstacle positioned between the gripper and the drawer handle, blocking direct straight\-line access\. As a result, the robot must execute a curved approach trajectory that circumvents the obstacle before making contact with the handle and applying the closing motion\. Multiple feasible strategies arise depending on how the gripper navigates around the obstacle\. The episode is considered successful when the drawer reaches the closed threshold without contacting the obstacle; any collision with the obstacle or failure to close the drawer results in episode failure\.
- •PickUpCup\.InPickUpCup, the robot is required to grasp and lift a cup from a table\. Rather than inducing multimodality through reaching trajectories, this task emphasizes multimodality in grasp selection\. We manually define four discrete grasping affordances located at angular positions of0∘0^\{\\circ\},90∘90^\{\\circ\},180∘180^\{\\circ\}, and270∘270^\{\\circ\}around the rim of the cup, corresponding to four distinct grasping strategies\. The gripper is restricted to grasping at these predefined locations; grasps at any other position are considered invalid\. Since the default RLBench cup does not provide such discrete handles, we explicitly specify these grasping points to simulate the intended affordance structure\. An episode succeeds only if the cup is grasped at one of the valid locations and lifted to the target height; grasping at an incorrect location or dropping the cup leads to failure\.
- •PlaceBlock\.ThePlaceBlocktask requires the robot to pick up a block and place it at a designated target location\. To promote multimodal behavior, we insert two obstacles into the scene: one between the gripper and the block, and another between the block’s initial position and the target placement region\. This setup forces the robot to choose among multiple distinct reaching paths to grasp the block, as well as different transport trajectories to carry it to the target while avoiding collisions\. The episode is successful if the block is placed within the target region without contacting any obstacle; collisions during either the reaching or placing phase result in failure\.
- •MeatOffGrill\.InMeatOffGrill, the robot must remove a piece of meat from a grill and place it at a target location\. Similar toPlaceBlock, we introduce two obstacles: one positioned between the gripper and the meat to constrain the approach trajectory, and another between the meat and the target region to constrain the carrying trajectory\. This creates a combinatorial set of feasible strategies arising from different ways of approaching the meat and transporting it to the target\. The task is considered successful when the meat is fully removed from the grill and placed in the target region without contacting any obstacle; touching an obstacle at any stage causes the episode to fail\.
- •OpenDoor\.Adapted from the robomimic benchmark\(Mandlekar et al\.,[2021](https://arxiv.org/html/2606.00336#bib.bib38)\),OpenDoorrequires the robot to reach the handle of a door, turn the latch, and swing the door open to a target angle\. Multimodality arises from the multiple feasible ways the end\-effector can approach and engage the handle before pulling, producing distinct but equally valid opening trajectories under identical initial conditions\. An episode is successful when the door is opened beyond the target angle within the horizon; failing to engage the handle or to open the door results in failure\.
- •OpenMicrowave\.Built on the Franka Kitchen environment\(Gupta et al\.,[2019](https://arxiv.org/html/2606.00336#bib.bib16)\),OpenMicrowavetasks a 9\-DoF Franka arm with pulling open the microwave door in a cluttered kitchen scene\. Because the handle can be reached and opened along several distinct approach paths among the surrounding kitchen fixtures, the task admits multiple valid execution modes\. The episode is considered successful when the microwave door is opened past the target threshold\.
- •Avoiding24 and Avoiding32\.Drawn from the D3IL benchmark\(Jia et al\.,[2024](https://arxiv.org/html/2606.00336#bib.bib25)\), theAvoidingtasks require the end\-effector to travel from a fixed start location to a goal on the far side of a field of static obstacles\. At each obstacle the agent may pass on either side, so the set of collision\-free routes defines a large family of execution modes:24distinct modes inAvoiding24and32inAvoiding32\. An episode is successful if the end\-effector reaches the goal without colliding with any obstacle; any collision terminates the episode as a failure\.

##### Multimodal Dataset Collection\.

For each task, we explicitly construct a set of discrete execution modes to induce structured inter\-mode diversity, and collect multiple noisy intra\-mode demonstration variants for each mode to capture execution\-level variability\. Each mode corresponds to a distinct high\-level strategy defined by geometric constraints in the environment \(e\.g\., obstacle avoidance direction or grasping affordance\), while demonstrations within the same mode differ due to stochasticity in execution and minor trajectory variations\. This design yields datasets that exhibit both clearly separable inter\-mode structure and realistic intra\-mode noise, enabling a controlled evaluation of mode representation, selection, and adaptation\. A summary of the number of modes, their geometric definitions, and the number of demonstrations collected per mode for each task is provided in Table[7](https://arxiv.org/html/2606.00336#A2.T7)\.

Table 7:Multimodal dataset construction\. For each task, we explicitly define discrete execution modes to induce inter\-mode diversity, and collect multiple noisy demonstration variants within each mode to capture intra\-mode variability\.

#### B\.1\.2Real Robot Setup

In addition to simulation experiments, we evaluate PDP on a real\-world manipulation task using a Franka Emika Panda robot arm\. The task isOpenDrawer, which requires the robot to reach a drawer handle, establish contact, and pull the drawer open to a target configuration\. The physical setup mirrors the simulation task at a high level but introduces substantially greater execution variability due to sensing noise, unmodeled dynamics, and human teleoperation\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x7.png)Figure 7:Visualization of multimodal demonstration datasets for three representative tasks\.First column:Real\-robotOpenDrawer, showing the physical scene \(top\) and collected end\-effector \(EE\) trajectories \(bottom\) from six distinct reaching\-and\-pulling modes\. Trajectories of the same color correspond to noisy intra\-mode demonstrations collected via SpaceMouse teleoperation\.Second column:SimulatedCloseDrawer, where the bottom panel shows EE trajectories corresponding to different obstacle\-circumventing reaching strategies around the drawer handle\.Third column:SimulatedPickUpCup, visualizing EE trajectories associated with different grasping strategies as the robot approaches the cup\.Fourth column:A top\-down visualization of grasping locations forPickUpCup, illustrating the four discrete grasping styles defined at angular positions0∘0^\{\\circ\},90∘90^\{\\circ\},180∘180^\{\\circ\}, and270∘270^\{\\circ\}around the cup rim\. Each grasping point corresponds to a distinct execution mode\. Across all tasks, different colors denote distinct modes, while variations within a color reflect intra\-mode execution noise\.To induce structured multimodality in the real\-robot setting, we explicitly design six distinct execution modes for reaching and pulling the drawer handle\. These modes differ in how the end\-effector approaches the handle prior to contact\. Specifically, the robot may approach from the left or right side of the handle, with two variants for each side corresponding to trajectories that remain closer to or farther from the straight start–end line\. In addition, we include two vertical approach modes, where the end\-effector reaches the handle from a higher or lower elevation relative to the handle center\. Each mode therefore corresponds to a distinct geometric strategy for handle acquisition and pulling\.

For each mode, we collect multiple noisy intra\-mode demonstration trajectories via human teleoperation using a SpaceMouse\. In total, we record 15 demonstrations per mode\. Compared to simulation, these demonstrations exhibit substantially higher variability due to teleoperation imprecision and physical interaction effects, resulting in noticeably noisier end\-effector trajectories even within the same mode\. This increased intra\-mode noise makes the real\-robot dataset a particularly challenging testbed for behavior representation and mode conditioning\.

#### B\.1\.3Visualization of True Multimodal Dataset Collection

Figure[7](https://arxiv.org/html/2606.00336#A2.F7)depicts representative multimodal demonstration data from three tasks: real\-robotOpenDrawer, and simulatedCloseDrawerandPickUpCup\. For each task, we show the task scene \(top row\) together with the corresponding end\-effector \(EE\) trajectories from the collected demonstrations \(bottom row\)\. Different colors indicate distinct execution modes, while trajectories within the same color correspond to noisy intra\-mode variants\. These visualizations highlight the structured inter\-mode diversity induced by our task design, as well as the substantial intra\-mode variability present in the demonstrations\. In particular, the real\-robotOpenDrawertrajectories exhibit significantly higher noise and dispersion compared to simulation, reflecting teleoperation imprecision and real\-world contact dynamics\. Together, these examples illustrate that our datasets capture true multimodality beyond simple stochastic perturbations, providing a challenging benchmark for behavior representation, mode selection, and adaptation\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x8.png)Figure 8:Examples of constraint\-induced behavior shifts for two representative tasks under four evaluation variants\.Top row:CloseDrawer, showing EE trajectories for the original training scene and three constraint\-shifted scenes\.Middle row:PickUpCupEE trajectories, illustrating different approach behaviors toward the cup under the same four scene variants\.Bottom row:PickUpCupgrasp\-point visualizations, showing the grasping locations on the cup rim that define the four training modes\. Columns correspond to \(from left to right\):Original Scene,Constraint Shift I \(Scene 1\),Constraint Shift II \(Scene 2\), andZero\-Mode Feasible \(Scene 3\)\. Solid trajectories indicate execution modes covered by the training dataset \(eight forCloseDrawer, four forPickUpCup\)\. Red trajectories indicate failure due to collision with obstacles or invalid grasping, causing early termination\. The dotted trajectory denotes a newly provided demonstration that is not present in the training data and is required to solve the task in the zero\-mode\-feasible setting\.

### B\.2Additional Results

Due to space constraints, the main text \(Table[2](https://arxiv.org/html/2606.00336#S5.T2)\) reports constraint\-shifted results on four representative domains\. Table[8](https://arxiv.org/html/2606.00336#A2.T8)provides the complete results across all eight manipulation domains and all nine methods, under the same evaluation protocol described in Appendix[B\.3](https://arxiv.org/html/2606.00336#A2.SS3)\.

Table 8:Full success rate \(%\) onconstraint\-shiftedscenes across all eight manipulation domains and nine methods \(mean±\\pmstd over 3 seeds\)\. Scene1,2invalidate all except a single training mode; Scene3is thezero\-mode\-feasiblesetting, where all training modes are invalidated and a single new demonstration is provided for adaptation\. The best performance in each row is shown in bold\. This table extends the four\-domain summary in Table[2](https://arxiv.org/html/2606.00336#S5.T2)\.Table[8](https://arxiv.org/html/2606.00336#A2.T8)reports the complete constraint\-shifted results across all eight domains and nine methods\. PDP attains the highest overall average success rate \(95\.0%95\.0\\%\), exceeding the strongest baseline by more than5050points, and is the only method that remains robust across every domain and every scene type\. Consistent with the main text, several baselines are competitive on individual domains—VQ\-BeT onCloseDrawer, BC\-GMM onOpenDoor, and Diff\-ES on theAvoidingtasks—but none generalizes across the full benchmark\. In particular, every baseline degrades sharply in the zero\-mode\-feasible setting \(Scene3\), where success requires synthesizing a behavior absent from the training data, whereas PDP sustains high success by fitting its geometry\-aligned latent to the single provided demonstration\. These full results reinforce the main\-text conclusion that an explicit, behavior\-aligned latent offers a more reliable adaptation interface\.

### B\.3Evaluation Variants and Constraint Shifts

We evaluate PDP and all baseline methods under a set of controlled environment variants designed to isolate behavior\-side adaptation under constraint\-induced shifts\. Each task is tested under four conditions: theOriginal Scene, which matches the training environment and serves as a baseline for multimodal imitation fidelity;Scene 1andScene 2, which introduce constraints that invalidate a subset of the demonstrated strategies while leaving exactly one training mode feasible; andScene 3, a zero\-mode\-feasible setting in which all training modes fail and success requires discovering a qualitatively new trajectory guided by a single new demonstration\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x9.png)Figure 9:Real\-robot constraint\-induced behavior shift forOpenDrawer\. To simulate realistic deployment conditions, everyday objects such as cups, books, and snack containers are placed between the robot and the drawer, blocking previously demonstrated reaching strategies\. These physical constraints invalidate training modes and require the policy to adapt its approach trajectory under real\-world noise and contact dynamics\.Figure[8](https://arxiv.org/html/2606.00336#A2.F8)illustrates these variants using two representative tasks:CloseDrawerandPickUpCup\. Solid trajectories denote execution modes covered by the training dataset \(eight forCloseDrawerand four forPickUpCup\)\. In constraint\-shifted scenes, trajectories shown in red indicate failure due to collision with obstacles or invalid grasping, causing the episode to terminate early\. The dotted trajectory corresponds to a newly provided demonstration that lies outside the training distribution and is required to solve the task in the zero\-mode\-feasible setting\. We include these examples for illustration; the remaining tasks follow analogous constraint constructions\. For the two pick\-and\-place tasks \(PlaceBlockandMeatOffGrill\), walls are inserted both between the initial gripper position and the object and between the object and the target, yielding constraint patterns similar toCloseDrawer\. For the real\-robotOpenDrawertask, physical objects such as cups, books, and snack containers are placed between the robot and the drawer to create realistic constraint\-induced behavior shifts, as shown in Figure[9](https://arxiv.org/html/2606.00336#A2.F9)\.

### B\.4Policy Training and Baseline Configurations

We evaluate Parameterized Diffusion Policy \(PDP\) and a set of representative behavior cloning baselines under a unified training and evaluation protocol\. All methods share the same backbone architecture and observation processing pipeline to ensure a fair comparison; differences arise only from their learning objectives, conditioning mechanisms, and test\-time adaptation strategies\.

PDP is trained using the joint optimization scheme described in Algorithm[2](https://arxiv.org/html/2606.00336#alg2), alternating between embedding refinement via the geometry\-aligned objectiveℒe​m​b​e​d\\mathcal\{L\}\_\{embed\}and diffusion denoiser optimization viaℒD​P\\mathcal\{L\}\_\{DP\}\. The denoiser is conditioned on the behavior latentzzthrough global modulation, enabling deep and explicit integration of the control signal\. At evaluation time in the original \(non\-shifted\) environments, we compute a representative latent for each mode by averaging the latent codes of demonstrations belonging to that mode, and execute the policy conditioned on this mean latent\. For each task, we perform a total of 40 rollouts, distributed evenly across modes\.

For long\-horizon tasks \(PlaceBlockandMeatOffGrill\), demonstrations are segmented into reaching and carrying phases based on gripper open/close events\. Each phase is normalized by its start and end positions prior to embedding, so that the behavior latent captures trajectory shape rather than absolute location\. During evaluation, the same latentzzis used to condition both phases, resulting in eight effective latents for these tasks despite their two\-stage structure\.

We compare PDP against four baseline paradigms: standard Diffusion Policy \(DP\), vanilla Behavior Cloning \(BC\), Behavior Cloning with Gaussian Mixture Models \(BC\-GMM\), and Implicit Behavioral Cloning \(IBC\)\. All baselines are trained on the same datasets and evaluated using 40 rollouts per task in both original and constraint\-shifted environments\. Baseline\-specific adaptation and inference procedures follow standard practice and are summarized in Table[9](https://arxiv.org/html/2606.00336#A2.T9)\.

Table 9:Comparison of training objectives, test\-time adaptation mechanisms, and evaluation protocols across PDP and baseline methods\. All methods use the same backbone architecture and observation encoder for fair comparison\.HyperparameterValueDiffusion stepsKK20Behavior latent dimensiondzd\_\{z\}2Embedding learning rate1×10−41\\times 10^\{\-4\}Denoiser learning rate1×10−41\\times 10^\{\-4\}Batch size256Global Modulation hidden dimension\[16, 32\]βKL\\beta\_\{\\mathrm\{KL\}\}\(VAE\)1×10−31\\times 10^\{\-3\}βgeo\\beta\_\{\\mathrm\{geo\}\}\(geometry loss\)1\.0Soft\-DTW smoothingγ\\gamma1×10−51\\times 10^\{\-5\}Table 10:Training hyperparameters used for PDP and baseline methods\. Unless otherwise specified, the same architectural and optimization settings are shared across methods\.

## Appendix CExtended Experimental Results

![Refer to caption](https://arxiv.org/html/2606.00336v1/x10.png)Figure 10:Diffusion training loss under different latent integration mechanisms\. Left:CloseDrawer\. Right:PickUpCup\. Global Modulation consistently converges to a substantially lower loss than Concatenation and Unconditioned variants, indicating reduced inter\-mode ambiguity during training\.### C\.1Training Dynamics of Latent Integration Mechanisms

To better understand why deep latent integration improves adaptation performance, we analyze the diffusion training dynamics under different latent integration mechanisms\. Figure[10](https://arxiv.org/html/2606.00336#A3.F10)plots the diffusion noise\-prediction loss throughout training for three variants: \(i\)*Global Modulation*, which integrates the behavior latent via affine modulation of intermediate feature maps, \(ii\)*Concatenation*, which appends the latentzzto the denoiser input, and \(iii\)*Unconditioned*, a standard diffusion policy without a behavior latent\.

Across bothCloseDrawerandPickUpCup, Global Modulation consistently achieves a substantially lower training loss—nearly an order of magnitude lower at convergence—than the other two variants\. In contrast, Concatenation and Unconditioned models converge to similar loss levels, indicating that shallow latent injection provides limited benefit for resolving mode ambiguity during training\.

This behavior directly supports our analysis in Appendix[A\.4](https://arxiv.org/html/2606.00336#A1.SS4)\. When the denoiser is weakly or not conditioned on behavior, it must implicitly average over multiple incompatible trajectory modes, resulting in a harder regression problem with higher irreducible variance\. Global Modulation collapses this inter\-mode ambiguity by explicitly reparameterizing the denoiser around a target behavior, converting diffusion from a global mixture\-modeling task into a mode\-specific denoising problem\. The resulting reduction in training loss reflects a genuine simplification of the learning objective rather than improved optimization alone, and explains why Global Modulation yields more stable and effective test\-time adaptation\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x11.png)Figure 11:Simulation trajectory visualizations under constraint\-induced shifts \(CloseDrawer, PickUpCup\)\.Columns compare PDP against DP, BC\-GMM, BC, and IBC\.*Top row \(CloseDrawer\):*executed end\-effector trajectories; PDP remains mode\-consistent and concentrated along feasible obstacle\-circumventing corridors, while unconditioned baselines exhibit mode interference and collapse into infeasible regions\.*Middle row \(PickUpCup\):*executed end\-effector trajectories;*Bottom row \(PickUpCup\):*realized grasp points on the cup rim\. PDP commits to valid grasp affordances with low dispersion, whereas unconditioned baselines produce scattered \(often invalid\) grasp attempts, reflecting cross\-mode averaging and unstable steering\.
### C\.2Comparative Trajectory Visualizations

Quantitative success rates summarize whether a rollout terminates in success, but they often under\-specify*how*that success is achieved and can obscure qualitatively different failure modes\. This is particularly important in our setting of*constraint\-induced behavior shifts*, where feasibility hinges on selecting \(or synthesizing\) a specific geometric strategy: two methods may have the same binary outcome on a subset of trials while exhibiting radically different levels of consistency, safety margin, and mode fidelity\. We therefore complement Figure[11](https://arxiv.org/html/2606.00336#A3.F11)—[12](https://arxiv.org/html/2606.00336#A3.F12)with trajectory\-level visualizations that directly reveal \(i\) whether a policy executes a coherent mode versus averaging across incompatible modes, \(ii\) whether the induced motion remains within the feasible corridor under new constraints\.

##### Simulation:CloseDrawer\(row 1 in Fig\.[11](https://arxiv.org/html/2606.00336#A3.F11)\)\.

TheCloseDrawerscene is multimodal because the end\-effector must route around an obstacle to reach the handle, yielding multiple distinct approach geometries\. In Fig\.[11](https://arxiv.org/html/2606.00336#A3.F11)\(top row\), PDP produces smooth, mode\-consistent trajectories that remain concentrated around a small set of feasible approach manifolds \(multiple colored rollouts indicate consistent reconstruction/selection of distinct modes\)\. In contrast, unconditioned DP exhibits pronounced*mode interference*: repeated rollouts spread widely and frequently drift into the obstacle region \(visualized by the dense, noisy trajectory bundle that “fills” the workspace rather than tracking a narrow approach corridor\)\. This is the qualitative signature of the ambiguity described in Appendix[A\.4](https://arxiv.org/html/2606.00336#A1.SS4): when the denoiser is not explicitly indexed by behavior, it must implicitly average denoising directions across incompatible strategies, producing unstable rollouts that collapse into infeasible regions\. BC\-GMM can partially preserve a small number of coarse modes, but the resulting trajectories still show substantial dispersion near the constraint boundary; BC and IBC further degenerate into highly scattered behavior, reflecting the lack of a reliable mechanism to*select*a single geometric strategy under ambiguity\.

##### Simulation:PickUpCup\(rows 2–3 in Fig\.[11](https://arxiv.org/html/2606.00336#A3.F11)\)\.

PickUpCupisolates a different kind of multimodality: the dominant discrete choice is the*grasp affordance*\(four valid grasp points around the rim\), while approach motions are conditioned on that choice\. Fig\.[11](https://arxiv.org/html/2606.00336#A3.F11)makes this distinction explicit by pairing trajectory rollouts \(middle row\) with the realized grasp points \(bottom row\)\. PDP concentrates its grasp selections tightly around the valid affordances, indicating that the learned latent provides a stable control handle for committing to a single grasp mode and executing it consistently\. Unconditioned DP, despite its expressive generative capacity, shows “averaging” behavior in grasp selection: realized grasp points disperse broadly along the rim and into invalid regions, matching the noisy and inconsistent approach trajectories\. BC\-GMM can sometimes select a plausible component but remains brittle under shifts; BC and IBC scatter across many invalid grasp attempts, indicating that small errors in action prediction translate into qualitatively wrong grasp commitment\. The key takeaway is that the advantage of PDP here is not merely smoother motion, but*mode commitment*: the latent\-conditioned denoiser resolves the discrete ambiguity early and then performs within\-mode denoising, preventing cross\-mode averaging from manifesting as invalid grasp selection\.

##### Real robot:OpenDrawer\(Fig\.[12](https://arxiv.org/html/2606.00336#A3.F12)\)\.

On hardware, execution noise, contact variability, and perception imperfections amplify the cost of ill\-conditioned steering: a method that only sporadically finds a correct strategy may still be unusable if its rollouts jitter across modes or repeatedly skim constraints\. Fig\.[12](https://arxiv.org/html/2606.00336#A3.F12)shows that PDP maintains coherent, repeatable approach\-and\-pull trajectories across trials, despite real\-world stochasticity, whereas DP exhibits substantially higher dispersion and inconsistent approach geometry\. This qualitative difference matches the real\-robot success\-rate gap \(Table[3](https://arxiv.org/html/2606.00336#S5.T3)\): PDP’s behavior latent acts as a low\-dimensional, geometry\-aligned interface that stabilizes test\-time steering, while noise\-space control in DP remains sensitive and produces high\-variance rollouts under constraint changes\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x12.png)Figure 12:Real\-robot OpenDrawer trajectory visualizations under constraint\-induced shifts\.PDP produces repeatable, coherent approach\-and\-pull trajectories across trials despite hardware noise, while DP exhibits substantially larger dispersion and inconsistent approach geometry, illustrating the instability of noise\-space steering under real\-world constraints\.
##### Overall takeaway\.

Across simulation and hardware, these visualizations corroborate the central claim of the paper: PDP converts diffusion from noise\-driven diversity into a controllable mechanism by exposing a geometry\-aligned behavior coordinate\. Trajectory plots reveal*how*this translates into practice—stable mode selection, reduced cross\-mode averaging, and safer constraint\-aware execution—properties that are difficult to diagnose from binary success rates alone\.

### C\.3Latent Space Navigation and Interpolation

In this section, we analyze how the learned behavior manifold supports controlled interpolation across qualitatively different task semantics \(Fig\.[13](https://arxiv.org/html/2606.00336#A3.F13)–[15](https://arxiv.org/html/2606.00336#A3.F15)\)\. These plots are designed to probe not whether PDP can reconstruct training modes, but whether the geometry induced byℒgeo\\mathcal\{L\}\_\{\\mathrm\{geo\}\}yields a*navigable*space in which intermediate latents correspond to coherent, physically meaningful behaviors\.

InCloseDrawer\(Fig\.[13](https://arxiv.org/html/2606.00336#A3.F13)\), the latent space encodes reaching styles around an obstacle\. We evaluate multiple interpolation paths—including linear and curved traversals—between distant mode clusters\. Despite traversing different trajectories in latent space, all interpolations produce smooth, collision\-free end\-effector paths that continuously deform the approach geometry\. This demonstrates that PDP does not merely interpolate between memorized trajectories, but instead induces a continuous family of feasible reaching strategies parameterized byzz\.

In contrast,PickUpCupexhibits a different semantic structure: the primary source of multimodality lies in discrete grasp affordances rather than approach paths\. As shown in Fig\.[14](https://arxiv.org/html/2606.00336#A3.F14), interpolatingzzproduces a smooth progression of grasp locations along the rim of the cup\. This confirms that the latent space does not collapse all tasks into a single notion of trajectory similarity; instead, it adapts to the task\-specific factors of variation encoded by the demonstrations\.

Finally, Fig\.[15](https://arxiv.org/html/2606.00336#A3.F15)replicates the same interpolation procedure on a real robot, where demonstrations exhibit substantial execution noise and unmodeled dynamics\. Despite this, interpolated latents yield consistent qualitative changes in end\-effector trajectories, indicating that the learned manifold remains coherent beyond simulation\. Together, these results support our claim that PDP learns a geometry\-aligned behavior space in which interpolation corresponds to semantically meaningful generalization, rather than arbitrary mixtures of training behaviors\.

![Refer to caption](https://arxiv.org/html/2606.00336v1/x13.png)Figure 13:Latent space interpolation onCloseDrawer\(simulation\)\.Top: learned latent distribution with clusters corresponding to distinct reaching strategies\. Middle: multiple interpolation paths in latent space, including linear and elliptical traversals between clusters\. Bottom: executed end\-effector trajectories produced by conditioning PDP on interpolated latents\. Despite differing interpolation geometries inzz, the resulting behaviors exhibit smooth, physically plausible transitions in approach style and obstacle avoidance, confirming that the latent manifold is continuous and navigable\.![Refer to caption](https://arxiv.org/html/2606.00336v1/x14.png)Figure 14:Latent interpolation for grasp selection onPickUpCup\.Left: latent distribution with clusters corresponding to discrete grasp affordances\. Middle: interpolation between grasp\-related latents\. Right: evaluated grasp points on the cup rim\. Unlike reaching\-dominated tasks, interpolation here induces a smooth shift in grasp location along the cup edge, demonstrating that the latent space captures task\-specific semantics beyond trajectory shape alone\.![Refer to caption](https://arxiv.org/html/2606.00336v1/x15.png)Figure 15:Latent space interpolation on a real robot\.Left: demonstration trajectories collected on hardware\. Middle: interpolated latents in the learned behavior space\. Right: executed end\-effector trajectories under latent interpolation\. Despite substantial execution noise and unmodeled dynamics, interpolated latents yield consistent and smoothly varying behaviors, indicating that the learned manifold generalizes beyond simulation\.

Similar Articles

Adaptive Order Policies for Masked Diffusion

arXiv cs.LG

Proposes learning the unmasking order in masked diffusion models using a lightweight policy network, with a weighted loss that outperforms heuristics on combinatorial tasks and protein design.

Better exploration with parameter noise

OpenAI Blog

OpenAI presents parameter noise, a technique that adds adaptive noise to neural network policy parameters rather than action spaces, enabling agents to learn tasks significantly faster than traditional action noise approaches. The method achieves 2x faster learning on HalfCheetah and represents a middle ground between evolution strategies and deep RL approaches like TRPO and DDPG.