WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

arXiv cs.LG Papers

Summary

Introduces WarmPrior, a method that replaces the standard Gaussian source in flow-matching policies with a temporally grounded prior from recent action history, consistently improving success rates on robotic manipulation tasks by producing straighter probability paths.

arXiv:2605.13959v1 Announce Type: new Abstract: Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:25 AM

# WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
Source: [https://arxiv.org/html/2605.13959](https://arxiv.org/html/2605.13959)
###### Abstract

Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control\. We show that replacing the standard Gaussian source distribution with*WarmPrior*, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks\. We trace this gain to markedly*straighter*probability paths, echoing the effect of optimal\-transport couplings in Rectified Flow\. Beyond standard behavior cloning,*WarmPrior*also reshapes the exploration distribution in prior\-space reinforcement learning, improving both sample efficiency and final performance\. Collectively, these results identify the*source distribution*as an important and underexplored design axis in generative robot control\. Project page:[https://sinnnj\.github\.io/WarmPrior/](https://sinnnj.github.io/WarmPrior/)\.

## 1Introduction

Learning generative policies for robotic manipulation, such as diffusion policies and flow\-matching policies, has become a dominant paradigm for multi\-modal behavior cloningChiet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib8)\); Bjorcket al\.\([2025b](https://arxiv.org/html/2605.13959#bib.bib40)\); Blacket al\.\([2025a](https://arxiv.org/html/2605.13959#bib.bib14)\)\. In these frameworks, a neural field transports samples from a fixed source distribution to the data manifold of action chunks\. Almost universally, this source distribution is the isotropic Gaussian𝒩​\(0,I\)\\mathcal\{N\}\(0,I\), a convention inherited from diffusion’s denoising\-from\-noise interpretationHoet al\.\([2020](https://arxiv.org/html/2605.13959#bib.bib26)\); Songet al\.\([2021](https://arxiv.org/html/2605.13959#bib.bib27)\)and preserved by flow matchingBraunet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib16)\); Huet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib17)\)and its few\-step policy descendantsPrasadet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib35)\); Luet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib36)\); Wanget al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib37)\), while progress was pushed through the network, the interpolant, and the integrator\. The*prior space*has been quietly left untouched\. Yet as denoising schedules shorten, the starting point absorbs more of the burden that integration steps once carried\. A stateless, uninformative source remains blind to the continuous, temporally correlated nature of robotic motion, forcing the policy to rebuild every action chunk from scratch\.

![Refer to caption](https://arxiv.org/html/2605.13959v1/x1.png)Figure 1:WarmPrior\.Standard flow\-matching policies transport samples from a context\-free𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)to the action manifold \(left\)\. WarmPrior initializes the transport from a temporally grounded Gaussian centered on the recent past\-action chunk \(Past\) or on the model’s own previous forecast of the current chunk \(Preview\) \(middle, right\)\. The resulting probability path is shorter, straighter, and temporally correlated across consecutive chunks\.We introduceWarmPrior, which replaces this stateless source with a*temporally grounded*prior whose mean is anchored on recent action history \([Figure˜1](https://arxiv.org/html/2605.13959#S1.F1)\)\. We instantiate it in two minimal variants:*WP\-Past*anchors the prior on the previously executed action chunk, while*WP\-Preview*trains the policy to predict twice the chunk length at each inference step and reuses the model’s own previous forecast of the current chunk as the prior mean\. Both add a residual Gaussian perturbationσ​ε\\sigma\\,\\varepsilonso that the source remains a proper distribution, and both leave the network, the interpolant, and the integrator untouched \([Section˜3](https://arxiv.org/html/2605.13959#S3)\)\.

This deliberately minimal intervention yields gains that compound along three independent axes\.*Geometrically*, starting close to the target manifold shortens the transport and straightens the learned probability paths, acting as an implicit optimal\-transport coupling that suppresses the irreducible endpoint ambiguity the network would otherwise average over \([Section˜5\.1](https://arxiv.org/html/2605.13959#S5.SS1)\)\.*Temporally*, the residual scaleσ\\sigmabecomes a continuous knob between within\-rollout commitment and multimodal expressiveness, supplying an implicit form of the consistency that action chunking enforces explicitly, and largely recovering baseline performance even when chunking is disabled \([Section˜5\.2](https://arxiv.org/html/2605.13959#S5.SS2)\)\.*Downstream*, WarmPrior recenters and shrinks the search space of prior\-space reinforcement learning around a temporally grounded mean, so a tighter residual action on top of a pretrained policy outperforms vanilla DSRLWagenmakeret al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib12)\)in both sample efficiency and asymptotic performance \([Section˜5\.3](https://arxiv.org/html/2605.13959#S5.SS3)\)\.

Empirically, on Robomimic, MimicGen, and a real Franka Research 3 setup, WarmPrior consistently improves success rate over the𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)baseline with both the Diffusion Policy backboneChiet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib8)\)and the VLA model GR00T N1\.5Bjorcket al\.\([2025a](https://arxiv.org/html/2605.13959#bib.bib15)\); the improvement is largest at the lowest inference budgets and on the harder tasks, where the curvature of the flow matters most \([Section˜4](https://arxiv.org/html/2605.13959#S4)\)\. Taken together, these results promote the*source distribution*from an inherited default to a first\-class, and previously underexplored, design axis in generative robotic control\.

## 2Background and Related Work

#### Flow\-matching policies\.

Flow matchingLipmanet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib3)\); Albergoet al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib4)\)trains a velocity networkvθ​\(t,at,o\)v\_\{\\theta\}\(t,a\_\{t\},o\)along the linear interpolantat=\(1−t\)​a0\+t​a1a\_\{t\}=\(1\-t\)\\,a\_\{0\}\+t\\,a\_\{1\}between a sourcea0∼p0a\_\{0\}\\sim p\_\{0\}and dataa1∼pdata\(⋅∣o\)a\_\{1\}\\sim p\_\{\\mathrm\{data\}\}\(\\cdot\\mid o\), and samples by integratinga˙t=vθ​\(t,at,o\)\\dot\{a\}\_\{t\}=v\_\{\\theta\}\(t,a\_\{t\},o\)froma0a\_\{0\}\. This paradigm underlies diffusion and flow\-matching policies for behavior cloningChiet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib8)\); Janneret al\.\([2022](https://arxiv.org/html/2605.13959#bib.bib9)\); Braunet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib16)\); Huet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib17)\); Chisariet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib38)\)and vision\-language\-action modelsBjorcket al\.\([2025b](https://arxiv.org/html/2605.13959#bib.bib40)\); Blacket al\.\([2025a](https://arxiv.org/html/2605.13959#bib.bib14)\); Physical Intelligenceet al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib23)\)\. Nearly all of them usep0=𝒩​\(0,I\)p\_\{0\}=\\mathcal\{N\}\(0,I\); our work revisits that choice\.

#### Optimal\-transport couplings and straightened flows\.

Under the independent coupling\(a0,a1\)∼p0⊗pdata\(a\_\{0\},a\_\{1\}\)\\sim p\_\{0\}\\otimes p\_\{\\mathrm\{data\}\}, crossing trajectories force the velocity network to average over ambiguous endpoints, producing curved paths\. Rectified FlowLiuet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib5)\), Multisample Flow MatchingPooladianet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib7)\), OT\-CFMTonget al\.\([2024a](https://arxiv.org/html/2605.13959#bib.bib6)\), and Schrödinger\-bridge variantsShiet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib18)\); Tonget al\.\([2024b](https://arxiv.org/html/2605.13959#bib.bib19)\)all reshape this*coupling*to approximate dynamic OT\. WarmPrior is complementary: it leaves the coupling independent and instead reshapes the*source distribution*so the flow begins already close to data, straightening paths \([Section˜5\.1](https://arxiv.org/html/2605.13959#S5.SS1)\) without any OT solver or retraining stage\.

#### Informed priors for generative robot policies\.

Modifying the source of a generative policy is a small but emerging direction\. BRIDGERChenet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib20)\)replaces the Gaussian source with a data\-aware, non\-Gaussian source policy and bridges it to the expert distribution via stochastic interpolants\. In concurrent work, STEPLiet al\.\([2026](https://arxiv.org/html/2605.13959#bib.bib21)\)trains an auxiliary action predictor whose output, perturbed by scheduled Gaussian noise, is injected at an*intermediate*denoising step rather than att=0t=0, so the warm start lives inside the diffusion trajectory\. A2AJiaet al\.\([2026](https://arxiv.org/html/2605.13959#bib.bib22)\)also anchors the prior on past actions, but encodes them deterministically into a latent source and composes deterministic ODE and decoder on top, making it effectively a history\-conditioned*deterministic flow transport model*rather than a stochastic generative sampler\. In contrast, WarmPrior preserves the stochastic flow\-matching formulation end\-to\-end and focuses squarely on how to construct the*prior space*p0p\_\{0\}itself \([Section˜3](https://arxiv.org/html/2605.13959#S3)\)\.

Algorithm 1Training and inference of FM policy with WarmPrior\.1:Input:dataset

𝒟\\mathcal\{D\}, interpolant

\(α,β\)\(\\alpha,\\beta\), noise scale

σ\\sigma, chunk length

HH\(prediction length is

HHfor Past,

2​H2Hfor Preview\)

2:Parameters:velocity net

vθv\_\{\\theta\}\(learnable\)

TrainingInference3:foreach iterationdo4:Sample\(o,a1,i\)∼𝒟\(o,a\_\{1\},i\)\\sim\\mathcal\{D\}5:Drawε∼𝒩​\(0,I\)\\varepsilon\\sim\\mathcal\{N\}\(0,I\)matchinga1a\_\{1\}; seta0←εa\_\{0\}\\leftarrow\\varepsilon6:ifPastthen7:a0←adata\[i−H:i\]\+σεa\_\{0\}\\leftarrow a^\{\\mathrm\{data\}\}\[i\\\!\-\\\!H\{:\}i\]\+\\sigma\\,\\varepsilon\(wheni≥Hi\\\!\\geq\\\!H\)8:elseifPreviewthen9:a0\[0:H\]←a1\[0:H\]\+σε\[0:H\]a\_\{0\}\[0\{:\}H\]\\leftarrow a\_\{1\}\[0\{:\}H\]\+\\sigma\\,\\varepsilon\[0\{:\}H\]10:endif11:t∼𝒰​\(0,1\)t\\sim\\mathcal\{U\}\(0,1\)12:at←α​\(t\)​a0\+β​\(t\)​a1a\_\{t\}\\leftarrow\\alpha\(t\)\\,a\_\{0\}\+\\beta\(t\)\\,a\_\{1\}13:ℒ←‖vθ​\(t,at,o\)−\(α˙​a0\+β˙​a1\)‖22\\mathcal\{L\}\\leftarrow\\\|v\_\{\\theta\}\(t,a\_\{t\},o\)\-\(\\dot\{\\alpha\}\\,a\_\{0\}\+\\dot\{\\beta\}\\,a\_\{1\}\)\\\|\_\{2\}^\{2\}14:Gradient step onθ\\theta15:endfor16:a^prev←∅\\hat\{a\}^\{\\mathrm\{prev\}\}\\leftarrow\\varnothing; reset env, observeoo17:whileepisode not donedo18:Drawε∼𝒩​\(0,I\)\\varepsilon\\sim\\mathcal\{N\}\(0,I\); seta0←εa\_\{0\}\\leftarrow\\varepsilon19:ifPastanda^prev≠∅\\hat\{a\}^\{\\mathrm\{prev\}\}\\neq\\varnothingthen20:a0←a^prev\+σ​εa\_\{0\}\\leftarrow\\hat\{a\}^\{\\mathrm\{prev\}\}\+\\sigma\\,\\varepsilon21:elseifPreviewanda^prev≠∅\\hat\{a\}^\{\\mathrm\{prev\}\}\\neq\\varnothingthen22:a0\[0:H\]←a^prev\[H:2H\]\+σε\[0:H\]a\_\{0\}\[0\{:\}H\]\\leftarrow\\hat\{a\}^\{\\mathrm\{prev\}\}\[H\{:\}2H\]\+\\sigma\\,\\varepsilon\[0\{:\}H\]23:endif24:a^←FMSample​\(vθ,a0,o\)\\hat\{a\}\\leftarrow\\textsc\{FMSample\}\(v\_\{\\theta\},a\_\{0\},o\)25:Executea^\[0:H\]\\hat\{a\}\[0\{:\}H\]; observe nextoo26:a^prev←a^\\hat\{a\}^\{\\mathrm\{prev\}\}\\leftarrow\\hat\{a\}27:endwhile

## 3WarmPrior

WarmPrior modifies only the source distribution of a flow\-matching policy: it reshapesp0p\_\{0\}while leaving the network, interpolant, and training objective untouched\. We instantiate it as two minimal variants,*WarmPrior\-Past*\(WP\-Past\) and*WarmPrior\-Preview*\(WP\-Preview\), which differ only in how the prior mean is anchored to the agent’s own action history\. Below we formalize the common template \([Algorithm˜1](https://arxiv.org/html/2605.13959#alg1)\) and then specify each variant in turn\.

#### Formulation\.

Leta0a\_\{0\}denote the sample drawn from the prior that the flow\-matching ODE transports into the predicted action chunk, with shapeH×daH\\times d\_\{a\}for Past and2​H×da2H\\times d\_\{a\}for Preview\. For a warm index set𝒲\\mathcal\{W\}over the prediction positions, with cold complement𝒞\\mathcal\{C\}, and a meanμ\\mudefined on𝒲\\mathcal\{W\}, WarmPrior samples

a0​\[τ\]=\{μτ\+σ​ετ,τ∈𝒲,ετ,τ∈𝒞,ε∼𝒩​\(0,I\)\.a\_\{0\}\[\\tau\]\\;=\\;\\begin\{cases\}\\mu\_\{\\tau\}\+\\sigma\\,\\varepsilon\_\{\\tau\},&\\tau\\in\\mathcal\{W\},\\\\ \\varepsilon\_\{\\tau\},&\\tau\\in\\mathcal\{C\},\\end\{cases\}\\qquad\\varepsilon\\sim\\mathcal\{N\}\(0,I\)\.\(1\)The cold region keeps the vanilla flow\-matching prior intact, so positions without a reliable anchor behave exactly as in the standard flow\-matching baseline\. The scalarσ\>0\\sigma\>0controls the residual noise on warm positions so that the warm region remains a proper distribution rather than a deterministic point mass; we fixσ\\sigmaper variant below and revisit it as a multimodality knob in[Section˜5\.2](https://arxiv.org/html/2605.13959#S5.SS2)\. Under this formulation, WarmPrior is fully specified by the pair\(𝒲,μ\)\(\\mathcal\{W\},\\mu\)together with the prediction length\. Our primary goal is to start the generative flow from a*plausible target action*rather than pure noise, and we propose two variants that differ in how the prior meanμ\\muis anchored\.

#### WarmPrior\-Past\.

The simplest plausible target is the previous action chunk: WP\-Past predicts a single chunk ofHHactions and anchorsμ\\muon the previous action chunk\.

At training, for each sample with in\-buffer indexii, we retrieve theHHpreceding actionsai−H:idataa^\{\\mathrm\{data\}\}\_\{i\-H:i\}from the replay buffer \(normalized to the training action space\), verify via a binary search on episode boundaries that the window lies within a single episode, and set:

μτPast=adata​\[i−H\+τ\],for​τ∈\{0,…,H−1\}\.\\mu^\{\\mathrm\{Past\}\}\_\{\\tau\}\\;=\\;a^\{\\mathrm\{data\}\}\[i\-H\+\\tau\],\\qquad\\text\{for \}\\tau\\in\\\{0,\\dots,H\-1\\\}\.\(2\)When the window would cross an episode boundary \(e\.g\., at the start of a demonstration\), the sample falls back to𝒲=∅\\mathcal\{W\}=\\emptyset\.

At inference, we directly use the previously executed action chunk, settingμτPast=a^τprev\\mu^\{\\mathrm\{Past\}\}\_\{\\tau\}=\\hat\{a\}^\{\\mathrm\{prev\}\}\_\{\\tau\}with𝒲=\{0,…,H−1\}\\mathcal\{W\}=\\\{0,\\dots,H\-1\\\}, and fall back to𝒲=∅\\mathcal\{W\}=\\emptysetat the first chunk\. We useσ=0\.5\\sigma=0\.5for this variant\.

#### WarmPrior\-Preview\.

WP\-Preview trains the policy to look one chunk further than it needs to: instead of predicting a single chunk ofHHactions, it predicts2​H2Hactions at each inference step and executes only the firstHH\. The secondHHsteps serve as a*preview*of the next chunk, acting as the model’s own forecast of future actions\. When the next decision step arrives, this preview aligns exactly with the firstHHpositions of the new prediction, providing a natural and highly accurate prior mean for the next generation process\. Crucially, across both training and inference, the2​H2H\-step generation is strictly partitioned: the firstHHsteps \(the actions to be executed\) are generated starting from the WarmPrior \(𝒲=\{0,…,H−1\}\\mathcal\{W\}=\\\{0,\\dots,H\-1\\\}\), while the secondHHsteps \(the preview\) are generated starting from pure Gaussian noise \(𝒞=\{H,…,2​H−1\}\\mathcal\{C\}=\\\{H,\\dots,2H\-1\\\}\)\.

At training, we face a chicken\-and\-egg problem: the ideal prior mean would be the model’s own past preview, which is unavailable before the model is trained\. However, the ground\-truth target itself is precisely the limit that a perfectly calibrated preview would converge to: at convergence, the model’s prior forecast of the current chunk should coincide with the chunk itself\. Thus, we can simply use the ground\-truth target itself as a proxy for a perfectly calibrated preview:

μτPreview=a1​\[τ\],for​τ∈\{0,…,H−1\},\\mu^\{\\mathrm\{Preview\}\}\_\{\\tau\}\\;=\\;a\_\{1\}\[\\tau\],\\qquad\\text\{for \}\\tau\\in\\\{0,\\dots,H\-1\\\},\(3\)wherea1∈ℝ2​H×daa\_\{1\}\\in\\mathbb\{R\}^\{2H\\times d\_\{a\}\}spans the full2​H2H\-step horizon\. We useσ=1\.0\\sigma=1\.0for this variant\.

At inference, leta^prev∈ℝ2​H×da\\hat\{a\}^\{\\mathrm\{prev\}\}\\in\\mathbb\{R\}^\{2H\\times d\_\{a\}\}be the previous2​H2H\-step prediction\. WP\-Preview sets

μτPreview=a^prev​\[H\+τ\],for​τ∈\{0,…,H−1\},\\mu^\{\\mathrm\{Preview\}\}\_\{\\tau\}\\;=\\;\\hat\{a\}^\{\\mathrm\{prev\}\}\[H\+\\tau\],\\qquad\\text\{for \}\\tau\\in\\\{0,\\dots,H\-1\\\},\(4\)so that the warm first half of the new prior carries the previous forecast of the current chunk, while the cold second half covers the new horizon that no previous preview has seen\. At the first chunk of an episode, where no previous prediction exists, we fall back to𝒲=∅\\mathcal\{W\}=\\emptyset\.

Table 1:Simulation success rate \(%\)on Robomimic and MimicGen \(image\) atH=8H=8across three inference budgets\. Parentheses show the absolute gain over the𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)baseline;greenmarks gains exceedingσbase\+σmethod\\sigma\_\{\\mathrm\{base\}\}\+\\sigma\_\{\\mathrm\{method\}\}\(non\-overlapping1​σ1\\sigmaseed intervals\)\. Best per \(task, NFE\) inbold\.NFE=9=9NFE=3=3NFE=1=1TaskBaseWP\-PastWP\-PreviewBaseWP\-PastWP\-PreviewBaseWP\-PastWP\-PreviewRobomimic — state observationSquare\-PH86\.788\.1\(\+1\.4\)88\.1\(\+1\.4\)86\.288\.0\(\+1\.8\)87\.9 \(\+1\.7\)83\.686\.6 \(\+3\.0\)87\.3\(\+3\.7\)Square\-MH65\.969\.2 \(\+3\.3\)72\.7\(\+6\.8\)65\.473\.2\(\+7\.8\)72\.9 \(\+7\.5\)65\.970\.1 \(\+4\.2\)77\.8\(\+11\.9\)Transport\-PH34\.136\.2 \(\+2\.1\)43\.3\(\+9\.2\)39\.044\.0 \(\+5\.0\)49\.1\(\+10\.1\)36\.839\.8 \(\+3\.0\)47\.6\(\+10\.8\)Transport\-MH16\.320\.7 \(\+4\.4\)24\.3\(\+8\.0\)21\.330\.7\(\+9\.4\)30\.4 \(\+9\.1\)23\.330\.2 \(\+6\.9\)34\.5\(\+11\.2\)Tool\-Hang\-PH79\.480\.6 \(\+1\.2\)82\.8\(\+3\.4\)72\.375\.1 \(\+2\.8\)75\.8\(\+3\.5\)77\.778\.2 \(\+0\.5\)81\.9\(\+4\.2\)Robomimic — image observationSquare\-PH86\.988\.2 \(\+1\.3\)88\.7\(\+1\.7\)87\.789\.2 \(\+1\.4\)89\.6\(\+1\.9\)88\.789\.3\(\+0\.6\)89\.1 \(\+0\.4\)Square\-MH76\.178\.0\(\+1\.9\)77\.8 \(\+1\.7\)73\.877\.9\(\+4\.1\)77\.1 \(\+3\.2\)72\.477\.6\(\+5\.2\)75\.1 \(\+2\.7\)Transport\-PH92\.894\.5\(\+1\.7\)94\.3 \(\+1\.6\)92\.193\.9 \(\+1\.9\)94\.9\(\+2\.9\)91\.393\.4 \(\+2\.2\)93\.7\(\+2\.4\)Transport\-MH74\.879\.7 \(\+4\.9\)79\.8\(\+4\.9\)73\.880\.0 \(\+6\.2\)80\.7\(\+6\.9\)74\.378\.6 \(\+4\.3\)79\.7\(\+5\.4\)Tool\-Hang\-PH43\.745\.8 \(\+2\.1\)56\.3\(\+12\.6\)36\.938\.4 \(\+1\.4\)50\.7\(\+13\.8\)41\.338\.9 \(−\-2\.4\)54\.0\(\+12\.7\)MimicGen — image observationStack21\.422\.8 \(\+1\.4\)31\.6\(\+10\.2\)21\.323\.7 \(\+2\.4\)30\.7\(\+9\.4\)21\.322\.4 \(\+1\.1\)28\.7\(\+7\.4\)Coffee26\.829\.6 \(\+2\.8\)34\.7\(\+7\.9\)23\.324\.1 \(\+0\.8\)33\.4\(\+10\.1\)16\.220\.4 \(\+4\.2\)29\.4\(\+13\.2\)Threading13\.815\.5 \(\+1\.7\)20\.9\(\+7\.1\)16\.316\.6 \(\+0\.3\)22\.0\(\+5\.7\)12\.515\.6 \(\+3\.1\)18\.0\(\+5\.5\)*Optimal transport\.*Among the WarmPrior variants we consider, Preview is the choice that pushes the prior mean as close as possible to the target: when the preview is accurate, the flow starts directly on the model’s own forecast of the current chunk and only has to correct its residual error \([Section˜5\.1](https://arxiv.org/html/2605.13959#S5.SS1)\)\.

*Residual policy interpretation\.*Because the warm portion of the prior already*is*a prediction of the current chunk, the flow only needs to learn the correctiona1−μPreviewa\_\{1\}\-\\mu^\{\\mathrm\{Preview\}\}on top of a committed forecast\. In this sense, WP\-Preview turns the generative policy into a residual policy that refines its own previous plan\.

## 4Main Results

### 4\.1Simulation

#### Setup\.

We evaluate in simulation on two robotic manipulation benchmarks: RobomimicMandlekaret al\.\([2021](https://arxiv.org/html/2605.13959#bib.bib11)\)and MimicGenMandlekaret al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib13)\)\. On Robomimic we evaluate under both state\- and image\-observation regimes onSquare,Transport, andTool\-Hangin thePH\(proficient\-human\) split, plus the harderMH\(multi\-human\) splits forSquareandTransport, omittingLiftandCanon which the flow\-matching policy already saturates near 100% success rate\. On MimicGen we use the human\-demonstration datasets \(10 demos per task\) forStack,Coffee, andThreadingunder image observations\.

We evaluate WarmPrior on the Diffusion Policy \(ChiTransformer\)Chiet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib8)\), a widely adopted policy architecture, trained here with flow matching\. All methods share the linear flow interpolant; only the source distribution changes\. Since behavior\-cloning training curves for Diffusion Policy on Robomimic are known to be noticeably noisy across checkpointsMandlekaret al\.\([2021](https://arxiv.org/html/2605.13959#bib.bib11)\), we train these models for a sufficient 200k iterations at a batch size of 1024 \(state\) or 256 \(image\)\. To mitigate this variance, we evaluate at regularly spaced checkpoints and average the performance of the top\-3 checkpoints per seed\. The success rate is computed over 200 episodes and 3 seeds at three inference budgets \(NFE∈\{9,3,1\}\\in\\\{9,3,1\\\}\)\. Unless stated otherwise, the action\-chunk length isH=8H=8for both Robomimic and MimicGen\. Full training hyperparameters and additional implementation details are provided in[Appendix˜C](https://arxiv.org/html/2605.13959#A3)\.

#### Performance improvements\.

[Table˜1](https://arxiv.org/html/2605.13959#S3.T1)reports the full NFE sweep\. The majority of evaluations exhibit non\-overlapping one\-standard\-deviation intervals between the baseline and our method \(green deltas\), demonstrating that this simple modification to the prior distribution yields ahighly significantperformance boost\. Furthermore, bold values highlight the best performance among the evaluated methods\. While WP\-Past achieves respectable performance gains, WP\-Preview demonstrates even greater improvements\. Finally, we observe that the magnitude of these improvements is most pronounced at the lowest inference budget, with thelargestaverage performance gains occurring at NFE=1=1\. We discuss the underlying reasons for both observations in[Section˜5\.1](https://arxiv.org/html/2605.13959#S5.SS1)\.

### 4\.2Real\-Robot Experiments

To validate our approach in the real world, we deploy our method on a Franka Research 3\. As illustrated in[Figure˜3](https://arxiv.org/html/2605.13959#S4.F3), we construct four tabletop manipulation tasks and collect human teleoperation demonstrations using the DROID platform setupKhazatskyet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib39)\)\. Each task is trained on a dataset of 30 demonstrations\.

For the policy architecture, we employ the GR00T N1\.5 VLA modelBjorcket al\.\([2025a](https://arxiv.org/html/2605.13959#bib.bib15)\), which also utilizes a flow\-matching action head\. The models are trained for 30k steps with a batch size of 64; further training details are provided in[Appendix˜C](https://arxiv.org/html/2605.13959#A3)\. During inference, the number of function evaluations \(NFE\) is fixed to 4\. We evaluate the performance using 3 independent training seeds, conducting 50 evaluation trials per seed\. As reported in[Figure˜3](https://arxiv.org/html/2605.13959#S4.F3), WarmPrior consistently improves the overall success rate across all four real\-world tasks, with the largest gains on the precision\-demanding*Cable Insertion*and*Block Stacking*\.

![Refer to caption](https://arxiv.org/html/2605.13959v1/x2.png)Figure 2:Real\-robot tasks\.Four tabletop manipulation scenes used in[Figure˜3](https://arxiv.org/html/2605.13959#S4.F3):*Food Waste Disposal*,*Cup Stacking*,*Block Stacking*, and*Cable Insertion*\.
![Refer to caption](https://arxiv.org/html/2605.13959v1/x3.png)Figure 3:Real\-robot success rate\.We evaluate WP\-Past, WP\-Preview, and the𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)baseline on four tabletop manipulation tasks, reporting the mean and standard deviation over three training seeds \(50 trials per seed\)\.

## 5Understanding and Extending WarmPrior

In this section, we investigate*why*replacing the standard𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)source with WarmPrior translates into the consistent gains of[Section˜4](https://arxiv.org/html/2605.13959#S4), and what further consequences follow\.[Section˜5\.1](https://arxiv.org/html/2605.13959#S5.SS1)gives a geometric account: WarmPrior shortens the transport and straightens the learned probability paths\.[Section˜5\.2](https://arxiv.org/html/2605.13959#S5.SS2)reveals a second, independent benefit,*temporal consistency*: WarmPrior supplies aσ\\sigma\-tunable form of the consistency that action chunking provides explicitly, and the effect is most pronounced when explicit chunking is turned off \(H=1H=1\)\.[Section˜5\.3](https://arxiv.org/html/2605.13959#S5.SS3)then extends the same prior to reinforcement learning, showing that it also reshapes the exploration space of prior\-space RL\. The first two subsections explain the behavior\-cloning gains of[Table˜1](https://arxiv.org/html/2605.13959#S3.T1); the third is a natural extension of WarmPrior to a downstream setting\.

### 5\.1WarmPrior Improves SR by Straightening Flow Trajectories

![Refer to caption](https://arxiv.org/html/2605.13959v1/x4.png)Figure 4:Flow trajectories onSquare\-MH\.Normalized action coordinate vs\. denoising timet∈\[0,1\]t\\\!\\in\\\!\[0,1\]; bottom markers: priorp0p\_\{0\}, top markers: predictionp1p\_\{1\}\.#### Empirical observation\.

[Figure˜4](https://arxiv.org/html/2605.13959#S5.F4)shows the integration paths ofata\_\{t\}for a baseline flow policy and for WarmPrior on the same observations from RobomimicSquare\-MH\. The baseline paths curve noticeably as the network pulls samples from a random origin onto the action manifold; the WarmPrior paths, already starting close to the manifold, are visibly straighter and more parallel\. Intuitively, because fewer flows cross one another, the conditional flow\-matching network spends less capacity realigning samples from the random base distribution and can devote more to refining actions, exactly where it matters for downstream success rate\.

Table 2:Pathwise curvatureκ​\(o\)\\kappa\(o\)of the learned flow on state\-observation tasks \([Equation˜5](https://arxiv.org/html/2605.13959#S5.E5); lower is straighter; values normalized so the𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)baseline reads1\.0001\.000\)\.Task𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)WP\-PastWP\-PreviewSquare\-PH1\.0000\.8230\.803Square\-MH1\.0000\.7050\.559Transport\-PH1\.0000\.7200\.692Transport\-MH1\.0000\.6950\.637Tool\-Hang\-PH1\.0000\.8060\.807
#### Curvature diagnostic\.

To make the observation quantitative, we measure the pathwise curvature of the learned flow\. For a smooth patha:\[0,1\]→ℝH×daa:\[0,1\]\\to\\mathbb\{R\}^\{H\\times d\_\{a\}\}witha˙t=vθ​\(t,at,o\)\\dot\{a\}\_\{t\}=v\_\{\\theta\}\(t,a\_\{t\},o\)we use the standard velocity\-variance surrogate

κ​\(o\)=∫01‖a˙t−v¯‖22​𝑑t,v¯=∫01a˙t​𝑑t,\\kappa\(o\)\\;=\\;\\int\_\{0\}^\{1\}\\\|\\dot\{a\}\_\{t\}\-\\bar\{v\}\\\|\_\{2\}^\{2\}\\,dt,\\qquad\\bar\{v\}=\\int\_\{0\}^\{1\}\\dot\{a\}\_\{t\}\\,dt,\(5\)evaluated by finite differences along the Euler sampler withN=100N=100steps\. We computeκ​\(o\)\\kappa\(o\)over2,0002\{,\}000validation observations and report the average in[Table˜2](https://arxiv.org/html/2605.13959#S5.T2)\. Every task exhibits a reduction in mean curvature, and the relative reduction tracks the success\-rate gain of[Table˜1](https://arxiv.org/html/2605.13959#S3.T1): tasks with the largest curvature reduction \(Square\-MH,Transport\-MH\) are also the tasks with the largest SR gain, supporting the straightening\-explains\-performance hypothesis\.

#### Branching cost: an irreducible residual\.

The curvature reduction has a measure\-theoretic origin we call the*branching cost*\. Vectorize an action chunk intoℝd\\mathbb\{R\}^\{d\}, let\(A0,A1\)∼Πo\(A\_\{0\},A\_\{1\}\)\\sim\\Pi\_\{o\}denote the conditional joint law of source and target, and writeAt=\(1−t\)​A0\+t​A1A\_\{t\}=\(1\\\!\-\\\!t\)A\_\{0\}\+tA\_\{1\}\. The flow\-matching objectiveℒo​\(v\)=∫01𝔼Πo​\[‖vt​\(At,o\)−\(A1−A0\)‖2∣o\]​𝑑t\\mathcal\{L\}\_\{o\}\(v\)=\\int\_\{0\}^\{1\}\\mathbb\{E\}\_\{\\Pi\_\{o\}\}\[\\\|v\_\{t\}\(A\_\{t\},o\)\-\(A\_\{1\}\-A\_\{0\}\)\\\|^\{2\}\\mid o\]\\,dtregresses the transport directionA1−A0A\_\{1\}\-A\_\{0\}, and because only\(At,o\)\(A\_\{t\},o\)is observable, the best attainable predictor is the conditional expectationvt⋆​\(x,o\)=𝔼​\[A1−A0∣At=x,o\]v\_\{t\}^\{\\star\}\(x,o\)=\\mathbb\{E\}\[A\_\{1\}\-A\_\{0\}\\mid A\_\{t\}=x,o\]\. The residual error this predictor cannot eliminate,

ℬ​\(o\)≔ℒo​\(v⋆\)=∫01𝔼\[∥A1−𝔼\[A1∣At,o\]∥2\|o\]\(1−t\)2​𝑑t,\\mathcal\{B\}\(o\)\\;\\coloneqq\\;\\mathcal\{L\}\_\{o\}\(v^\{\\star\}\)\\;=\\;\\int\_\{0\}^\{1\}\\frac\{\\mathbb\{E\}\\\!\\bigl\[\\\|A\_\{1\}\-\\mathbb\{E\}\[A\_\{1\}\\\!\\mid\\\!A\_\{t\},o\]\\\|^\{2\}\\,\\bigm\|\\,o\\bigr\]\}\{\(1\-t\)^\{2\}\}\\,dt,\(6\)measures how ambiguousA1A\_\{1\}remains after observingAtA\_\{t\}: when many distinct targets share anAtA\_\{t\},v⋆v^\{\\star\}must average over them and the trajectory bends\. A standard total\-variance decomposition splits the coupling cost𝔼​\[‖A1−A0‖2∣o\]\\mathbb\{E\}\[\\\|A\_\{1\}\-A\_\{0\}\\\|^\{2\}\\mid o\]into the kinetic action ofv⋆v^\{\\star\}plusℬ​\(o\)\\mathcal\{B\}\(o\)\(see[Appendix˜B](https://arxiv.org/html/2605.13959#A2)for the full derivation\); the second term is pure excess caused by directional ambiguity and vanishes for OT couplings, whereℬ≡0\\mathcal\{B\}\\equiv 0McCann \([1997](https://arxiv.org/html/2605.13959#bib.bib1)\); Benamou and Brenier \([2000](https://arxiv.org/html/2605.13959#bib.bib2)\)\.

#### How WarmPrior reduces the branching cost\.

WarmPrior writes the source asA0=P𝒲​\(μ\+σ​Ξ\)\+P𝒞​ΞA\_\{0\}=P\_\{\\mathcal\{W\}\}\(\\mu\+\\sigma\\Xi\)\+P\_\{\\mathcal\{C\}\}\\XiwithΞ∼𝒩​\(0,I\)\\Xi\\sim\\mathcal\{N\}\(0,I\)independent of\(A1,μ\)\(A\_\{1\},\\mu\)givenoo, whereP𝒲P\_\{\\mathcal\{W\}\}projects onto the warm coordinates \(d𝒲d\_\{\\mathcal\{W\}\}dimensions\) andP𝒞=I−P𝒲P\_\{\\mathcal\{C\}\}=I\-P\_\{\\mathcal\{W\}\}\. Bounding the optimal predictor’s error by that of the simpler predictorP𝒲​AtP\_\{\\mathcal\{W\}\}A\_\{t\}cancels the\(1−t\)2\(1\-t\)^\{2\}factor in \([6](https://arxiv.org/html/2605.13959#S5.E6)\) \([Appendix˜B](https://arxiv.org/html/2605.13959#A2), Proposition[B\.2](https://arxiv.org/html/2605.13959#A2.Thmtheorem2)\), giving

ℬ𝒲​\(o\)≤𝔼\[∥P𝒲\(A1−μ\)∥2\|o\]⏟mean mismatch\+σ2​d𝒲\.\\mathcal\{B\}\_\{\\mathcal\{W\}\}\(o\)\\;\\leq\\;\\underbrace\{\\mathbb\{E\}\\\!\\left\[\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\\\|^\{2\}\\,\\middle\|\\,o\\right\]\}\_\{\\text\{mean mismatch\}\}\+\\;\\sigma^\{2\}d\_\{\\mathcal\{W\}\}\.\(7\)The warm\-coordinate branching cost is therefore controlled by two intuitive quantities: how well the prior meanμ\\mupredicts the target, and how much residual noiseσ\\sigmais injected\. This immediately explains the ordering of our variants\.Previewsetsμ\\muto a forecast of the current chunk, so the mismatch reduces to the forecast error𝔼​\[‖E‖2∣o\]\\mathbb\{E\}\[\\\|E\\\|^\{2\}\\mid o\]withP𝒲​A1=P𝒲​μ\+EP\_\{\\mathcal\{W\}\}A\_\{1\}=P\_\{\\mathcal\{W\}\}\\mu\+E; in the idealized limit of an exact forecast \(E=0E=0\) only the irreducibleσ2​d𝒲\\sigma^\{2\}d\_\{\\mathcal\{W\}\}term survives\.Pastreuses the previously executed chunk, replacingEEwith the persistence residualRRbetween consecutive chunks and yielding the same form\(prediction error\)\+σ2​d𝒲\(\\text\{prediction error\}\)\+\\sigma^\{2\}d\_\{\\mathcal\{W\}\}\. Whenever the forecaster improves on persistence \(𝔼​\[‖E‖2∣o\]≤𝔼​\[‖R‖2∣o\]\\mathbb\{E\}\[\\\|E\\\|^\{2\}\\mid o\]\\leq\\mathbb\{E\}\[\\\|R\\\|^\{2\}\\mid o\]\), Preview attains a tighter bound, matching the ordering observed in[Table˜1](https://arxiv.org/html/2605.13959#S3.T1)\. In both cases WarmPrior acts as an amortized approximation to the OT coupling, shortening transport and suppressing the directional ambiguity that bends the learned field\.

The bound also exposes a trade\-off alongσ\\sigma, an axis separate from aligningμ\\mu: smallerσ\\sigmatightensσ2​d𝒲\\sigma^\{2\}d\_\{\\mathcal\{W\}\}in[Equation˜7](https://arxiv.org/html/2605.13959#S5.E7)and favors a straighter field, but concentrates the source ontoμ\\mu, which only helps ifμ\\muis reliable\. In practice it is not \(WP\-Past carries the persistence residual, WP\-Preview the forecast error\), so an overly smallσ\\sigmaleaves no slack to absorb this variability and degrades success rate\. The rightσ\\sigmabalances straightness against robustness to prior\-mean diversity; we defer the full ablation to[Appendix˜D](https://arxiv.org/html/2605.13959#A4)\.

![Refer to caption](https://arxiv.org/html/2605.13959v1/x5.png)Figure 5:Mode switching in a 1D navigation toy\.All policies share a 1024\-d 4\-layer MLP backbone trained for 50k iterations with batch size 256\. Six demonstrations pass through two obstacles \(three above, three below\), inducing a bimodalp​\(a∣o\)p\(a\\mid o\)at each position\.\(a\)training data;\(b\)regression collapses to the mean;\(c\)naive flow matching recovers both modes but oscillates between them;\(d\)history\-conditioned flow matching commits within a rollout but drifts off\-manifold under inference\-time history shift;\(e, f\)WarmPrior atσ=0\.1\\sigma\\\!=\\\!0\.1andσ=0\.5\\sigma\\\!=\\\!0\.5commits per rollout, withσ\\sigmatuning between temporal consistency and multimodality\.

### 5\.2WarmPrior as a Tunable Source of Temporal Consistency

Beyond the geometric benefit of[Section˜5\.1](https://arxiv.org/html/2605.13959#S5.SS1), WarmPrior provides a second, independent advantage: a*tunable*form of*implicit temporal consistency*between consecutive inferences\.

#### Mode switching in generative control policies\.

Consider the 1D navigation toy in[Figure˜5](https://arxiv.org/html/2605.13959#S5.F5)\(a\), where the observationoois the agent’s horizontal position and the actionaais its vertical height\. Demonstrations split evenly between passing above and passing below each obstacle, so the conditional distributionp​\(a∣o\)p\(a\\mid o\)is multimodal at everyoo\. A regression policy averages the branches and collapses to the mean, driving straight through the obstacle \([Figure˜5](https://arxiv.org/html/2605.13959#S5.F5), panel b\)\. A flow\-matching policy trained on the standard objective recovers both modes, but only at the level of*per\-inference marginals*: the objective places no constraint linking the chunk produced at one inference to the chunk produced at the next\. The policy is therefore free to pick a different mode at each inference, yielding an execution that oscillates rapidly between them, a pathology we term*mode switching*\([Figure˜5](https://arxiv.org/html/2605.13959#S5.F5), panel c\)\. Action chunkingZhaoet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib10)\)enforces commitment*within*a chunk, but the objective still treats consecutive chunks independently and the oscillation persists at every chunk boundary\. A natural remedy is to condition the policy on the*action history*hh, but naive history conditioning is costly and fragile: it substantially slows convergence and inflates per\-step compute and memoryKooet al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib25)\), and has two further drawbacks\. First, conditioning onhhpins the policy to whichever mode its history already commits to, reducing the effective multimodality ofp​\(a∣o\)p\(a\\mid o\): in[Figure˜5](https://arxiv.org/html/2605.13959#S5.F5)\(d\), every rollout follows either an above\-above or a below\-below path, with no recombination across branches\. Second, at inference time small execution errors compound into a distributional shift overhh\.

#### Tunable temporal consistency viaσ\\sigma\.

Because the prior mean is correlated with the previous action chunk, a smallσ\\sigmakeeps the new chunk inside the nearby mode’s basin and prevents the flow from crossing between distant modes \([Figure˜5](https://arxiv.org/html/2605.13959#S5.F5), panel e\), while a largeσ\\sigmabroadens the source and recovers more of the multimodal distribution at every step \([Figure˜5](https://arxiv.org/html/2605.13959#S5.F5), panel f\)\. The prior variance therefore acts as a continuous*regulator*between*temporal consistency*and*multimodality*\. Crucially, unlike history conditioning or long action chunks that*explicitly*enforce temporal consistency at training or inference time, WarmPrior only*implicitly*biases the source distribution: within each rollout it commits the policy to a single coherent mode, while leaving the generative policy’s room for multimodality intact across rollouts\.

#### Isolating the consistency effect atH=1H=1\.

To strip away explicit chunking and isolate the prior’s implicit consistency bias, we setH=1H=1, so the policy runs a fresh inference every timestep and the WarmPrior becomes the sole source of inter\-step consistency\.[Figure˜7](https://arxiv.org/html/2605.13959#S5.F7)reports SR atH=1H=1and NFE=1=1\. The𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)baseline degrades sharply \(e\.g\.Transport\-MH:23\.3%→1\.3%23\.3\\%\\\!\\to\\\!1\.3\\%\), while WarmPrior recovers most of the lost performance, with gains of up to\+14\.8\+14\.8on MimicGenCoffee\. These gains are consistently larger than at the defaultH=8H=8, confirming that the prior carries more weight once explicit chunking is stripped away\.

#### Practical implication\.

Action chunking locks the policy to anHH\-step plan and cannot react to new observations within a chunk, which is a liability on tasks requiring fast reactivity\. WarmPrior offers an alternative that*preserves temporal consistency while allowing per\-step re\-planning*, and we see this direction as a promising avenue for future work\.

### 5\.3WarmPrior Improves Prior\-Space RL Efficiency

Beyond behavior cloning, the same source\-distribution shaping extends to the RL stage: using the WarmPrior\-pretrained policy as the frozen base, the prior mean additionally reshapes the exploration space of prior\-space reinforcement learning and yields a substantial efficiency gain\.

#### Background: DSRL\.

DSRLWagenmakeret al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib12)\)fine\-tunes a pretrained diffusion policy with reinforcement learning by acting in the prior space: instead of having the RL agent output actions directly, the agent proposes the prior samplea0a\_\{0\}, which is then mapped deterministically to an actiona1a\_\{1\}via the frozen pretrained policy’s ODE sampler \(DDIMSonget al\.\([2021](https://arxiv.org/html/2605.13959#bib.bib27)\)or flow\-matching\)\. The RL action space isℝH×da\\mathbb\{R\}^\{H\\times d\_\{a\}\}with the shape of the prior\. By acting on the prior sample, DSRL eliminates the need to backpropagate through the diffusion sampler and makes the policy compatible with off\-the\-shelf RL algorithms\. Two algorithms are commonly used within this framework: DSRL\-SAC, which applies SACHaarnojaet al\.\([2018](https://arxiv.org/html/2605.13959#bib.bib28)\)directly to the noise\-space MDP, and DSRL\-NA, which exploits the diffusion policy’s noise\-aliasing structure through a dual\-critic scheme that distills an action\-space criticQAQ^\{A\}into a noise\-space criticQWQ^\{W\}\. However, the exploration space remains the uninformative𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)prior, forcing the RL agent to search across the full prior from scratch\.

#### Method: Conditioned\-residual WarmPrior\.

WarmPrior offers an immediate structural improvement: because the WarmPrior mean is already close to the target action manifold, the RL agent only has to learn a bounded*residual*around it\. Concretely, we extend the observation too~=\[o,μ\]\\tilde\{o\}=\[o,\\mu\]and bound the RL action to a small magnitudeδ\\delta:

a0=μ\+Δ,Δ=πRL​\(o~\)∈\[−δ,δ\]H×da\.a\_\{0\}=\\mu\+\\Delta,\\quad\\Delta=\\pi\_\{\\mathrm\{RL\}\}\(\\tilde\{o\}\)\\in\[\-\\delta,\\delta\]^\{H\\times d\_\{a\}\}\.\(8\)In practice we setδ=0\.5\\delta=0\.5, compared toδ=1\.5\\delta=1\.5used by vanilla DSRL\. The RL agent now explores a3×3\\timestighter region centered on a temporally grounded WarmPrior mean rather than the origin, so the agent no longer searches the full prior from scratch and instead refines a local correction around an anchor that is already a competent action\. The RL policy also receives the prior meanμ\\muas part of its augmented observationo~\\tilde\{o\}\. Sinceμ\\mualready encodes past chunks, appending it to the observation absorbs that dependency into the state so the RL problem stays Markovian, and it lets the residualΔ\\Deltaadapt to the current anchor\.

![Refer to caption](https://arxiv.org/html/2605.13959v1/x6.png)Figure 6:Action\-chunk lengthH=1H=1results \(NFE=1=1\)\.Five Robomimic state tasks and three MimicGen image tasks\.
![Refer to caption](https://arxiv.org/html/2605.13959v1/x7.png)Figure 7:Prior\-space RL\.DSRL baselines vs\. WarmPrior variants on RobomimicSquareandTransport, averaged over 3 seeds \(±1​σ\\pm 1\\sigmashading\)\.

#### Setup\.

Among the Robomimic tasks,LiftandCanare already near\-saturated under BC, so we run RL fine\-tuning onSquareandTransportwith a frozen WarmPrior backbone pretrained by behavior cloning for 3000 epochs\. Our WP\-Past and WP\-Preview instantiate DSRL\-NA with the conditioned residual of \([8](https://arxiv.org/html/2605.13959#S5.E8)\), and we compare against vanilla DSRL\-SAC and DSRL\-NA as baselines\.

#### Findings\.

[Figure˜7](https://arxiv.org/html/2605.13959#S5.F7)shows that WP\-Past and WP\-Preview learn faster, converge more stably, and reach a higher asymptote than both DSRL baselines: both consistently exceed0\.990\.99onSquare, and onTransportthey attain∼0\.97\\sim\\\!0\.97, while DSRL\-NA and DSRL\-SAC plateau around0\.90\.9\. To our knowledge, this is the first result to stably surpass95%95\\%success onTransport, the hardest of the Robomimic tasks, by RL fine\-tuning of a flow\-matching policy\. Because WarmPrior provides an efficient, semantically meaningful prior space centered onμ​\(o,h\)\\mu\(o,h\), searching over it is far more valuable than exploring an uninformed random noise space\.

## 6Conclusion

We revisited the source distribution of generative robotic policies and showed that replacing the uninformative𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)with a temporally grounded WarmPrior consistently improves success rate on Robomimic, MimicGen, and a real Franka setup\. This single design choice straightens the learned flow in an OT\-aligned sense, exposes a continuousσ\\sigma\-knob between within\-rollout consistency and multimodal expressiveness, and shrinks the search space of prior\-space RL on top of the pretrained policy\. Because WarmPrior leaves the network, interpolant, and loss untouched, we view the prior distribution as a new axis worth exploring in generative\-policy design\.

## References

- M\. S\. Albergo, N\. M\. Boffi, and E\. Vanden\-Eijnden \(2025\)Stochastic interpolants: a unifying framework for flows and diffusions\.Journal of Machine Learning Research26\(209\),pp\. 1–80\.Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- J\. Benamou and Y\. Brenier \(2000\)A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem\.Numerische Mathematik84\(3\),pp\. 375–393\.Cited by:[§5\.1](https://arxiv.org/html/2605.13959#S5.SS1.SSS0.Px3.p1.15)\.
- J\. Bjorck, V\. Blukis, F\. Castañeda, N\. Cherniadev, X\. Da, R\. Ding, L\. Fan, Y\. Fang, D\. Fox, F\. Hu, S\. Huang, J\. Jang, X\. Jiang, J\. Kautz, K\. Kundalia, Z\. Li, K\. Lin, Z\. Lin, L\. Magne, Y\. Man, A\. Mandlekar, A\. Narayan, S\. Nasiriany, S\. Reed, Y\. L\. Tan, G\. Wang, J\. Wang, Q\. Wang, S\. Wang, J\. Xiang, Y\. Xie, Y\. Xu, S\. Ye, Z\. Yu, Y\. Zhao, Z\. Zhang, R\. Zheng, and Y\. Zhu \(2025a\)GR00T N1\.5: an open foundation model for generalist humanoid robots\.NVIDIA Isaac GR00T technical report\.Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1),[§1](https://arxiv.org/html/2605.13959#S1.p4.1),[§4\.2](https://arxiv.org/html/2605.13959#S4.SS2.p2.1)\.
- J\. Bjorck, F\. Castañeda, N\. Cherniadev, X\. Da, R\. Ding, L\. Fan, Y\. Fang, D\. Fox, F\. Hu, S\. Huang,et al\.\(2025b\)GR00T N1: an open foundation model for generalist humanoid robots\.arXiv preprint arXiv:2503\.14734\.Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1),[§1](https://arxiv.org/html/2605.13959#S1.p1.1),[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- K\. Black, N\. Brown, D\. Driess, A\. Esmail, M\. Equi, C\. Finn, N\. Fusai, L\. Groom, K\. Hausman, B\. Ichter, S\. Jakubczak, T\. Jones, L\. Ke, S\. Levine, A\. Li\-Bell, M\. Mothukuri, S\. Nair, K\. Pertsch, L\. X\. Shi, J\. Tanner, Q\. Vuong, A\. Walling, H\. Wang, and U\. Zhilinsky \(2025a\)π0\\pi\_\{0\}: a vision\-language\-action flow model for general robot control\.InRobotics: Science and Systems,Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1),[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- K\. Black, M\. Y\. Galliker, and S\. Levine \(2025b\)Real\-time execution of action chunking flow policies\.arXiv preprint arXiv:2506\.07339\.Cited by:[Appendix E](https://arxiv.org/html/2605.13959#A5.p1.1)\.
- M\. Braun, N\. Jaquier, L\. Rozo, and T\. Asfour \(2024\)Riemannian flow matching policy for robot motion learning\.InIEEE/RSJ International Conference on Intelligent Robots and Systems,Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1),[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- G\. Chen, Z\. Li, S\. Wang, J\. Jiang, Y\. Liu, L\. Lu, D\. Huang, W\. Byeon, M\. Le, T\. Rintamaki, T\. Poon, M\. Ehrlich, T\. Lu, L\. Wang, B\. Catanzaro, J\. Kautz, A\. Tao, Z\. Yu, and G\. Liu \(2025\)Eagle 2\.5: boosting long\-context post\-training for frontier vision\-language models\.arXiv preprint arXiv:2504\.15271\.Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- K\. Chen, E\. Lim, K\. Lin, Y\. Chen, and H\. Soh \(2024\)Don’t start from scratch: behavioral refinement via interpolant\-based policy diffusion\.arXiv preprint arXiv:2402\.16075\.Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px3.p1.2)\.
- C\. Chi, Z\. Xu, S\. Feng, E\. Cousineau, Y\. Du, B\. Burchfiel, R\. Tedrake, and S\. Song \(2023\)Diffusion policy: visuomotor policy learning via action diffusion\.InRobotics: Science and Systems,Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1),[§1](https://arxiv.org/html/2605.13959#S1.p1.1),[§1](https://arxiv.org/html/2605.13959#S1.p4.1),[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7),[§4\.1](https://arxiv.org/html/2605.13959#S4.SS1.SSS0.Px1.p2.2)\.
- E\. Chisari, N\. Heppert, M\. Argus, T\. Welschehold, T\. Brox, and A\. Valada \(2024\)Learning robotic manipulation policies from point clouds with conditional flow matching\.InConference on Robot Learning,Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational Conference on Machine Learning,Cited by:[§5\.3](https://arxiv.org/html/2605.13959#S5.SS3.SSS0.Px1.p1.6)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InIEEE Conference on Computer Vision and Pattern Recognition,Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1)\.
- X\. Hu, B\. Liu, X\. Liu, and Q\. Liu \(2024\)AdaFlow: imitation learning with variance\-adaptive flow\-based policies\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1),[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- M\. Janner, Y\. Du, J\. B\. Tenenbaum, and S\. Levine \(2022\)Planning with diffusion for flexible behavior synthesis\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- J\. Jia, G\. Li, X\. Chen, T\. An, Y\. Hu, J\. Li, X\. Guo, and J\. Yang \(2026\)Action\-to\-action flow matching\.arXiv preprint arXiv:2602\.07322\.Cited by:[Appendix D](https://arxiv.org/html/2605.13959#A4.SS0.SSS0.Px3.p1.8),[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px3.p1.2)\.
- A\. Khazatsky, K\. Pertsch, S\. Nair, A\. Balakrishna, S\. Dasari, S\. Karamcheti, S\. Nasiriany, M\. K\. Srirama, L\. Y\. Chen, K\. Ellis,et al\.\(2024\)DROID: a large\-scale in\-the\-wild robot manipulation dataset\.InRobotics: Science and Systems,Cited by:[Appendix E](https://arxiv.org/html/2605.13959#A5.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.13959#S4.SS2.p1.1)\.
- M\. Koo, D\. Choi, T\. Kim, K\. Lee, C\. Kim, Y\. Seo, and J\. Shin \(2025\)HAMLET: switch your vision\-language\-action model into a history\-aware policy\.arXiv preprint arXiv:2510\.00695\.Cited by:[§5\.2](https://arxiv.org/html/2605.13959#S5.SS2.SSS0.Px1.p1.8)\.
- J\. Li, Y\. Cong, Y\. Wang, H\. Xia, S\. Huang, Y\. Zhang, N\. Xu, and G\. Dai \(2026\)STEP: warm\-started visuomotor policies with spatiotemporal consistency prediction\.arXiv preprint arXiv:2602\.08245\.Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px3.p1.2)\.
- Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow matching for generative modeling\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- X\. Liu, C\. Gong, and Q\. Liu \(2023\)Flow straight and fast: learning to generate and transfer data with rectified flow\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px2.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- G\. Lu, Z\. Gao, T\. Chen, W\. Dai, Z\. Wang, W\. Ding, and Y\. Tang \(2024\)ManiCM: real\-time 3D diffusion policy via consistency model for robotic manipulation\.arXiv preprint arXiv:2406\.01586\.Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1)\.
- A\. Mandlekar, S\. Nasiriany, B\. Wen, I\. Akinola, Y\. Narang, L\. Fan, Y\. Zhu, and D\. Fox \(2023\)MimicGen: a data generation system for scalable robot learning using human demonstrations\.InConference on Robot Learning,Cited by:[§4\.1](https://arxiv.org/html/2605.13959#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Mandlekar, D\. Xu, J\. Wong, S\. Nasiriany, C\. Wang, R\. Kulkarni, L\. Fei\-Fei, S\. Savarese, Y\. Zhu, and R\. Martín\-Martín \(2021\)What matters in learning from offline human demonstrations for robot manipulation\.InConference on Robot Learning,Cited by:[§4\.1](https://arxiv.org/html/2605.13959#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13959#S4.SS1.SSS0.Px1.p2.2)\.
- R\. J\. McCann \(1997\)A convexity principle for interacting gases\.Advances in Mathematics128\(1\),pp\. 153–179\.Cited by:[§5\.1](https://arxiv.org/html/2605.13959#S5.SS1.SSS0.Px3.p1.15)\.
- W\. Peebles and S\. Xie \(2023\)Scalable diffusion models with transformers\.InIEEE/CVF International Conference on Computer Vision,Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- Physical Intelligence, K\. Black, N\. Brown, J\. Darpinian, K\. Dhabalia, D\. Driess, A\. Esmail, M\. Equi, C\. Finn, N\. Fusai, M\. Y\. Galliker, D\. Ghosh, L\. Groom, K\. Hausman, B\. Ichter, S\. Jakubczak, T\. Jones, L\. Ke, D\. LeBlanc, S\. Levine, A\. Li\-Bell, M\. Mothukuri, S\. Nair, K\. Pertsch, A\. Z\. Ren, L\. X\. Shi, L\. Smith, J\. T\. Springenberg, K\. Stachowicz, J\. Tanner, Q\. Vuong, H\. Walke, A\. Walling, H\. Wang, L\. Yu, and U\. Zhilinsky \(2025\)π0\.5\\pi\_\{0\.5\}: a vision\-language\-action model with open\-world generalization\.arXiv preprint arXiv:2504\.16054\.Cited by:[Appendix E](https://arxiv.org/html/2605.13959#A5.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px1.p1.7)\.
- A\. Pooladian, H\. Ben\-Hamu, C\. Domingo\-Enrich, B\. Amos, Y\. Lipman, and R\. T\. Q\. Chen \(2023\)Multisample flow matching: straightening flows with minibatch couplings\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Prasad, K\. Lin, J\. Wu, L\. Zhou, and J\. Bohg \(2024\)Consistency policy: accelerated visuomotor policies via consistency distillation\.InRobotics: Science and Systems,Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- Y\. Shi, V\. D\. Bortoli, A\. Campbell, and A\. Doucet \(2023\)Diffusion Schrödinger bridge matching\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2021\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.13959#S5.SS3.SSS0.Px1.p1.6)\.
- A\. Tong, K\. Fatras, N\. Malkin, G\. Huguet, Y\. Zhang, J\. Rector\-Brooks, G\. Wolf, and Y\. Bengio \(2024a\)Improving and generalizing flow\-based generative models with minibatch optimal transport\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Tong, N\. Malkin, K\. Fatras, L\. Atanackovic, Y\. Zhang, G\. Huguet, G\. Wolf, and Y\. Bengio \(2024b\)Simulation\-free Schrödinger bridges via score and flow matching\.InInternational Conference on Artificial Intelligence and Statistics,Cited by:[§2](https://arxiv.org/html/2605.13959#S2.SS0.SSS0.Px2.p1.1)\.
- TRI LBM Team \(2025\)A careful examination of large behavior models for multitask dexterous manipulation\.arXiv preprint arXiv:2507\.05331\.Cited by:[Appendix A](https://arxiv.org/html/2605.13959#A1.p1.8)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- A\. Wagenmaker, M\. Nakamoto, Y\. Zhang, S\. Park, W\. Yagoub, A\. Nagabandi, A\. Gupta, and S\. Levine \(2025\)Steering your diffusion policy with latent space reinforcement learning\.arXiv preprint arXiv:2506\.15799\.Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p3.1),[§5\.3](https://arxiv.org/html/2605.13959#S5.SS3.SSS0.Px1.p1.6)\.
- Z\. Wang, Z\. Li, A\. Mandlekar, Z\. Xu, J\. Fan, Y\. Narang, L\. Fan, Y\. Zhu, Y\. Balaji, M\. Zhou, M\. Liu, and Y\. Zeng \(2025\)One\-step diffusion policy: fast visuomotor policies via diffusion distillation\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.13959#S1.p1.1)\.
- Y\. Wu and K\. He \(2018\)Group normalization\.InEuropean Conference on Computer Vision,Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- X\. Zhai, B\. Mustafa, A\. Kolesnikov, and L\. Beyer \(2023\)Sigmoid loss for language image pre\-training\.InIEEE/CVF International Conference on Computer Vision,Cited by:[Appendix C](https://arxiv.org/html/2605.13959#A3.p1.1)\.
- T\. Z\. Zhao, V\. Kumar, S\. Levine, and C\. Finn \(2023\)Learning fine\-grained bimanual manipulation with low\-cost hardware\.InRobotics: Science and Systems,Cited by:[§5\.2](https://arxiv.org/html/2605.13959#S5.SS2.SSS0.Px1.p1.8)\.

## Appendix AVisualizing Success\-Rate Uncertainty with Beta Posteriors

We adopt the evaluation philosophy ofTRI LBM Team \([2025](https://arxiv.org/html/2605.13959#bib.bib43)\), which argues that single\-number means with Gaussian error bars are an impoverished summary of policy performance and instead pushes for full posterior visualizations of the success\-rate parameter\. A seed\-standard\-error bar implicitly assumes the per\-seed SR is symmetric and well approximated by a Gaussian; for a Bernoulli event near0or11the likelihood is skewed and the bar would cross an impossible boundary, and it conveys nothing about how*overlapping*two methods’ distributions are\. We therefore complement the bar charts in the main paper with Beta\-posterior violins: for each \(method, task\) cell withkksuccesses out ofn=200×3=600n=200\\times 3=600rollouts \(200 episodes per seed, 3 seeds\), we visualize the posterior Beta\(k\+1,n−k\+1\)\(k\\\!\+\\\!1,\\,n\\\!\-\\\!k\\\!\+\\\!1\)under a uniform Beta\(1,1\)\(1,1\)prior\. The violin width atyyis proportional to the posterior PDF, and the horizontal tick marks the posterior mean\. This makes three things easy to read off: \(i\) the plot is bounded to\[0,1\]\[0,1\]and skewed near the edges, so near\-saturated and near\-zero cells are rendered honestly; \(ii\) posterior*overlap*between two methods is immediate, a more faithful proxy for significance than non\-overlapping error bars; and \(iii\) high\-variance cells \(flatter violins\) are visually distinguishable from confidently\-estimated ones \(tight violins\)\.

One caveat: pooling all600600rollouts into a single Beta posterior treats them as i\.i\.d\. Bernoulli draws from a common success probability, folding the three seeds’ \(different\) policies into a single rate\. This captures within\-policy sampling uncertainty but absorbs across\-seed \(policy\-level\) variance into the same Bernoulli noise, so the posterior should be read as an estimate of the*pooled*success rate; the seed\-SE bars in the main paper remain the appropriate reference for method\-level uncertainty\.

[Figure˜8](https://arxiv.org/html/2605.13959#A1.F8)shows this analysis for the main setting \(H=8H=8, NFE=1=1\), and[Figure˜9](https://arxiv.org/html/2605.13959#A1.F9)shows it when chunking is disabled \(H=1H=1, NFE=1=1\)\.

![Refer to caption](https://arxiv.org/html/2605.13959v1/x8.png)Figure 8:Main results \(H=8H=8, NFE=1=1\): Beta\-posterior violins\.Same data as the NFE=1=1column of[Table˜1](https://arxiv.org/html/2605.13959#S3.T1); the violin width is proportional to the Beta\(k\+1,n−k\+1\)\(k\+1,n\-k\+1\)posterior PDF over the success rate, withn=600n=600rollouts per cell\.![Refer to caption](https://arxiv.org/html/2605.13959v1/x9.png)Figure 9:Action\-chunk lengthH=1H=1results \(NFE=1=1\): Beta\-posterior violins\.Same data as[Figure˜7](https://arxiv.org/html/2605.13959#S5.F7); visualization is identical in style to[Figure˜8](https://arxiv.org/html/2605.13959#A1.F8)\.
## Appendix BWhy WarmPrior Straightens Flows: A Branching\-Cost Analysis

This appendix develops the theory underlying WarmPrior\. We first formalize how endpoint ambiguity in the source\-target coupling induces a branching cost in the learned flow \([Section˜B\.1](https://arxiv.org/html/2605.13959#A2.SS1)\), and then derive a bound showing how WarmPrior provably reduces this cost \([Section˜B\.2](https://arxiv.org/html/2605.13959#A2.SS2)\)\. For readability, we vectorize an action chunk into a single vector inℝd\\mathbb\{R\}^\{d\}, whered=H​dad=Hd\_\{a\}, and condition throughout on the policy inputoo\. We write\(A0,A1\)∼Πo\(A\_\{0\},A\_\{1\}\)\\sim\\Pi\_\{o\}for the conditional joint law induced by the training procedure, whereA0A\_\{0\}is the source sample andA1A\_\{1\}is the target action chunk\. Under the linear interpolant,

At=\(1−t\)​A0\+t​A1,t∈\[0,1\]\.A\_\{t\}=\(1\-t\)A\_\{0\}\+tA\_\{1\},\\qquad t\\in\[0,1\]\.\(9\)
#### What the flow\-matching loss is regressing\.

For the linear interpolant, the target velocity is

A˙t=A1−A0\.\\dot\{A\}\_\{t\}=A\_\{1\}\-A\_\{0\}\.Hence the population flow\-matching objective for a velocity fieldvt​\(⋅,o\)v\_\{t\}\(\\cdot,o\)is

ℒo\(v\)≔∫01𝔼\[∥vt\(At,o\)−\(A1−A0\)∥22\|o\]dt\.\\mathcal\{L\}\_\{o\}\(v\)\\;\\coloneqq\\;\\int\_\{0\}^\{1\}\\mathbb\{E\}\\\!\\left\[\\\|v\_\{t\}\(A\_\{t\},o\)\-\(A\_\{1\}\-A\_\{0\}\)\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]dt\.\(10\)This is simply anL2L^\{2\}regression problem: from the observable pair\(At,o\)\(A\_\{t\},o\), the network tries to predict the transport directionA1−A0A\_\{1\}\-A\_\{0\}\.

### B\.1The branching cost as endpoint ambiguity

The next theorem states that the irreducible error of this regression problem is exactly the conditional variance of the endpointA1A\_\{1\}given the intermediate pointAtA\_\{t\}\. This is the sense in which path intersection or branching creates curvature: if many distinct endpoints are compatible with the same intermediate point, the vector field must average over them\.

###### Theorem B\.1\(Exact formula for the branching cost\)\.

For eacht∈\[0,1\)t\\in\[0,1\), the unique minimizer ofℒo​\(v\)\\mathcal\{L\}\_\{o\}\(v\)is

vt⋆​\(x,o\)=𝔼​\[A1−A0∣At=x,o\]=𝔼​\[A1∣At=x,o\]−x1−t\.v\_\{t\}^\{\\star\}\(x,o\)=\\mathbb\{E\}\[A\_\{1\}\\\!\-\\\!A\_\{0\}\\mid A\_\{t\}\\\!=\\\!x,o\]=\\frac\{\\mathbb\{E\}\[A\_\{1\}\\mid A\_\{t\}\\\!=\\\!x,o\]\-x\}\{1\-t\}\.\(11\)Define

ℬ​\(o\)≔ℒo​\(v⋆\)\.\\mathcal\{B\}\(o\)\\;\\coloneqq\\;\\mathcal\{L\}\_\{o\}\(v^\{\\star\}\)\.\(12\)Then

ℬ\(o\)=∫011\(1−t\)2𝔼\[∥A1−𝔼\[A1∣At,o\]∥22\|o\]dt\.\\mathcal\{B\}\(o\)=\\int\_\{0\}^\{1\}\\frac\{1\}\{\(1\\\!\-\\\!t\)^\{2\}\}\\mathbb\{E\}\\\!\\left\[\\\|A\_\{1\}\\\!\-\\\!\\mathbb\{E\}\[A\_\{1\}\\\!\\mid\\\!A\_\{t\},o\]\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]dt\.\(13\)

#### Proof sketch\.

The only information available to the predictor is\(At,o\)\(A\_\{t\},o\), so the best possibleL2L^\{2\}predictor of the target velocityA1−A0A\_\{1\}\-A\_\{0\}is its conditional expectation given\(At,o\)\(A\_\{t\},o\)\. The minimum mean\-squared error is therefore the conditional variance of that target velocity\. For the linear interpolant,A1−A0=\(A1−At\)/\(1−t\)A\_\{1\}\-A\_\{0\}=\(A\_\{1\}\-A\_\{t\}\)/\(1\-t\), so this conditional variance can be rewritten directly in terms of the ambiguity of the endpointA1A\_\{1\}after observing the intermediate pointAtA\_\{t\}\.

###### Proof\.

Fixt<1t<1, and define

Δ≔A1−A0\.\\Delta\\;\\coloneqq\\;A\_\{1\}\-A\_\{0\}\.The integrand of \([10](https://arxiv.org/html/2605.13959#A2.E10)\) is anL2L^\{2\}regression problem: among all\(At,o\)\(A\_\{t\},o\)\-measurable square\-integrable random variables, the unique minimizer of

g↦𝔼​\[‖g−Δ‖22∣o\]g\\;\\mapsto\\;\\mathbb\{E\}\[\\\|g\-\\Delta\\\|\_\{2\}^\{2\}\\mid o\]is the orthogonal projection ofΔ\\Deltaonto the sigma\-field generated by\(At,o\)\(A\_\{t\},o\), namely

g⋆=𝔼​\[Δ∣At,o\]\.g^\{\\star\}=\\mathbb\{E\}\[\\Delta\\mid A\_\{t\},o\]\.Therefore

vt⋆​\(At,o\)=𝔼​\[A1−A0∣At,o\]\.v\_\{t\}^\{\\star\}\(A\_\{t\},o\)=\\mathbb\{E\}\[A\_\{1\}\-A\_\{0\}\\mid A\_\{t\},o\]\.This proves the first equality in \([11](https://arxiv.org/html/2605.13959#A2.E11)\)\.

To prove the second equality, observe from \([9](https://arxiv.org/html/2605.13959#A2.E9)\) that

At=\(1−t\)​A0\+t​A1⟹A1−A0=A1−At1−t\.A\_\{t\}=\(1\-t\)A\_\{0\}\+tA\_\{1\}\\;\\Longrightarrow\\;A\_\{1\}\-A\_\{0\}=\\frac\{A\_\{1\}\-A\_\{t\}\}\{1\-t\}\.Taking the conditional expectation givenAt=xA\_\{t\}=xandooyields

vt⋆​\(x,o\)=𝔼​\[A1−A0∣At=x,o\]=𝔼​\[A1∣At=x,o\]−x1−t,v\_\{t\}^\{\\star\}\(x,o\)=\\mathbb\{E\}\[A\_\{1\}\\\!\-\\\!A\_\{0\}\\mid A\_\{t\}\\\!=\\\!x,o\]=\\frac\{\\mathbb\{E\}\[A\_\{1\}\\mid A\_\{t\}\\\!=\\\!x,o\]\-x\}\{1\-t\},which proves \([11](https://arxiv.org/html/2605.13959#A2.E11)\)\.

We now compute the minimum value of the risk\. By the standard projection identity for conditional expectation,

infg𝔼\[∥g−Δ∥22∣o\]=𝔼\[∥Δ−𝔼\[Δ∣At,o\]∥22∣o\]\.\\inf\_\{g\}\\mathbb\{E\}\[\\\|g\-\\Delta\\\|\_\{2\}^\{2\}\\mid o\]=\\mathbb\{E\}\[\\\|\\Delta\-\\mathbb\{E\}\[\\Delta\\mid A\_\{t\},o\]\\\|\_\{2\}^\{2\}\\mid o\]\.Hence

ℬ\(o\)=∫01𝔼\[∥Δ−𝔼\[Δ∣At,o\]∥22\|o\]dt\.\\mathcal\{B\}\(o\)=\\int\_\{0\}^\{1\}\\mathbb\{E\}\\\!\\left\[\\\|\\Delta\-\\mathbb\{E\}\[\\Delta\\mid A\_\{t\},o\]\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]dt\.Using againΔ=\(A1−At\)/\(1−t\)\\Delta=\(A\_\{1\}\-A\_\{t\}\)/\(1\-t\), we obtain

Δ−𝔼​\[Δ∣At,o\]=A1−𝔼​\[A1∣At,o\]1−t\.\\Delta\-\\mathbb\{E\}\[\\Delta\\mid A\_\{t\},o\]=\\frac\{A\_\{1\}\-\\mathbb\{E\}\[A\_\{1\}\\mid A\_\{t\},o\]\}\{1\-t\}\.Substituting this into the previous display gives \([13](https://arxiv.org/html/2605.13959#A2.E13)\)\. ∎

#### Interpretation\.

The quantityℬ​\(o\)\\mathcal\{B\}\(o\)is the irreducible flow\-matching error under the couplingΠo\\Pi\_\{o\}\. Equation \([13](https://arxiv.org/html/2605.13959#A2.E13)\) shows that it is exactly the time\-integrated ambiguity of the endpointA1A\_\{1\}after observing the intermediate pointAtA\_\{t\}\. IfAtA\_\{t\}almost surely determinesA1A\_\{1\}, thenℬ​\(o\)=0\\mathcal\{B\}\(o\)=0, meaning there is no branching ambiguity for the vector field to average over\. This is the ideal non\-branching situation realized by OT\-style Monge couplings\.

### B\.2A detailed derivation of the WarmPrior bound

We now prove the bound used in the main text\. LetP𝒲P\_\{\\mathcal\{W\}\}denote the orthogonal projection onto the warm coordinates and letP𝒞=I−P𝒲P\_\{\\mathcal\{C\}\}=I\-P\_\{\\mathcal\{W\}\}be the projection onto the cold coordinates\. We write

d𝒲≔tr⁡\(P𝒲\),d\_\{\\mathcal\{W\}\}\\;\\coloneqq\\;\\operatorname\{tr\}\(P\_\{\\mathcal\{W\}\}\),so thatd𝒲d\_\{\\mathcal\{W\}\}is exactly the number of warm scalar coordinates\.

Under WarmPrior, the source sample takes the form

A0=P𝒲​\(μ\+σ​Ξ\)\+P𝒞​Ξ,Ξ∼𝒩​\(0,Id\),A\_\{0\}=P\_\{\\mathcal\{W\}\}\(\\mu\+\\sigma\\Xi\)\+P\_\{\\mathcal\{C\}\}\\Xi,\\quad\\Xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),\(14\)whereΞ\\Xiis conditionally independent of\(A1,μ\)\(A\_\{1\},\\mu\)givenoo\. The first term says that on the warm coordinates we start near a structured meanμ\\mu, while on the cold coordinates we keep the vanilla Gaussian prior\.

To isolate the effect of the warm part, define the warm\-coordinate branching cost by

ℬ𝒲\(o\)≔∫011\(1−t\)2𝔼\[∥P𝒲A1−𝔼\[P𝒲A1∣At,o\]∥22\|o\]dt\.\\mathcal\{B\}\_\{\\mathcal\{W\}\}\(o\)\\;\\coloneqq\\;\\int\_\{0\}^\{1\}\\frac\{1\}\{\(1\\\!\-\\\!t\)^\{2\}\}\\mathbb\{E\}\\\!\\left\[\\\|P\_\{\\mathcal\{W\}\}A\_\{1\}\\\!\-\\\!\\mathbb\{E\}\[P\_\{\\mathcal\{W\}\}A\_\{1\}\\\!\\mid\\\!A\_\{t\},o\]\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]dt\.\(15\)
###### Proposition B\.2\(WarmPrior upper bound on the warm\-coordinate branching cost\)\.

Under \([14](https://arxiv.org/html/2605.13959#A2.E14)\),

ℬ𝒲\(o\)≤𝔼\[∥P𝒲\(A1−μ\)∥22\|o\]\+σ2d𝒲\.\\mathcal\{B\}\_\{\\mathcal\{W\}\}\(o\)\\;\\leq\\;\\mathbb\{E\}\\\!\\left\[\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]\+\\sigma^\{2\}d\_\{\\mathcal\{W\}\}\.\(16\)

#### Proof sketch\.

The proof has one key idea\. The conditional expectation𝔼​\[P𝒲​A1∣At,o\]\\mathbb\{E\}\[P\_\{\\mathcal\{W\}\}A\_\{1\}\\mid A\_\{t\},o\]is the*best*predictor of the warm endpoint from\(At,o\)\(A\_\{t\},o\), so we may upper\-bound its error by evaluating the same error at any simpler predictor\. We choose the very simple predictorP𝒲​AtP\_\{\\mathcal\{W\}\}A\_\{t\}, because it is directly observable from the interpolated sample and because, under the linear interpolant, the differenceP𝒲​A1−P𝒲​AtP\_\{\\mathcal\{W\}\}A\_\{1\}\-P\_\{\\mathcal\{W\}\}A\_\{t\}contains an explicit factor of\(1−t\)\(1\-t\)that exactly cancels the prefactor1/\(1−t\)21/\(1\-t\)^\{2\}in \([15](https://arxiv.org/html/2605.13959#A2.E15)\)\. After this cancellation, the remaining expression splits into a mean\-mismatch term and a Gaussian\-noise term\.

###### Proof\.

We proceed in three explicit steps\.

#### Step 1: replace the optimal predictor with a tractable surrogate\.

For any square\-integrable random variablesYYandXX, the conditional expectation𝔼​\[Y∣X\]\\mathbb\{E\}\[Y\\mid X\]is the unique minimizer of

g↦𝔼​\[‖Y−g​\(X\)‖22\]\.g\\;\\mapsto\\;\\mathbb\{E\}\[\\\|Y\-g\(X\)\\\|\_\{2\}^\{2\}\]\.Equivalently,

𝔼\[∥Y−𝔼\[Y∣X\]∥22\]≤𝔼\[∥Y−g\(X\)∥22\]for every measurableg\.\\mathbb\{E\}\\\!\\left\[\\\|Y\\\!\-\\\!\\mathbb\{E\}\[Y\\\!\\mid\\\!X\]\\\|\_\{2\}^\{2\}\\right\]\\leq\\mathbb\{E\}\\\!\\left\[\\\|Y\\\!\-\\\!g\(X\)\\\|\_\{2\}^\{2\}\\right\]\\quad\\text\{for every measurable \}g\.\(17\)We apply this with

Y=P𝒲​A1,X=\(At,o\),g​\(At,o\)=P𝒲​At\.Y=P\_\{\\mathcal\{W\}\}A\_\{1\},\\;\\;X=\(A\_\{t\},o\),\\;\\;g\(A\_\{t\},o\)=P\_\{\\mathcal\{W\}\}A\_\{t\}\.This choice is valid becauseP𝒲​AtP\_\{\\mathcal\{W\}\}A\_\{t\}is clearly measurable with respect to\(At,o\)\(A\_\{t\},o\)\. Using \([17](https://arxiv.org/html/2605.13959#A2.E17)\) inside \([15](https://arxiv.org/html/2605.13959#A2.E15)\) gives

ℬ𝒲\(o\)≤∫011\(1−t\)2𝔼\[∥P𝒲A1−P𝒲At∥22\|o\]dt\.\\mathcal\{B\}\_\{\\mathcal\{W\}\}\(o\)\\leq\\int\_\{0\}^\{1\}\\frac\{1\}\{\(1\\\!\-\\\!t\)^\{2\}\}\\mathbb\{E\}\\\!\\left\[\\\|P\_\{\\mathcal\{W\}\}A\_\{1\}\\\!\-\\\!P\_\{\\mathcal\{W\}\}A\_\{t\}\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]dt\.\(18\)

#### Step 2: express the bound in the WarmPrior parameters\(μ,σ\)\(\\mu,\\sigma\)\.

FromAt=\(1−t\)​A0\+t​A1A\_\{t\}=\(1\-t\)A\_\{0\}\+tA\_\{1\}, we have

P𝒲​At=\(1−t\)​P𝒲​A0\+t​P𝒲​A1\.P\_\{\\mathcal\{W\}\}A\_\{t\}=\(1\\\!\-\\\!t\)P\_\{\\mathcal\{W\}\}A\_\{0\}\+tP\_\{\\mathcal\{W\}\}A\_\{1\}\.Hence

P𝒲​A1−P𝒲​At=\(1−t\)​\(P𝒲​A1−P𝒲​A0\)\.P\_\{\\mathcal\{W\}\}A\_\{1\}\-P\_\{\\mathcal\{W\}\}A\_\{t\}=\(1\\\!\-\\\!t\)\\bigl\(P\_\{\\mathcal\{W\}\}A\_\{1\}\-P\_\{\\mathcal\{W\}\}A\_\{0\}\\bigr\)\.\(19\)Now substitute the WarmPrior form \([14](https://arxiv.org/html/2605.13959#A2.E14)\): on the warm coordinates,

P𝒲​A0=P𝒲​\(μ\+σ​Ξ\)\.P\_\{\\mathcal\{W\}\}A\_\{0\}=P\_\{\\mathcal\{W\}\}\(\\mu\+\\sigma\\Xi\)\.Therefore

P𝒲​A1−P𝒲​A0=P𝒲​\(A1−μ\)−σ​P𝒲​Ξ\.P\_\{\\mathcal\{W\}\}A\_\{1\}\-P\_\{\\mathcal\{W\}\}A\_\{0\}=P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\-\\sigma P\_\{\\mathcal\{W\}\}\\Xi\.\(20\)Combining \([19](https://arxiv.org/html/2605.13959#A2.E19)\) and \([20](https://arxiv.org/html/2605.13959#A2.E20)\),

P𝒲​A1−P𝒲​At=\(1−t\)​\(P𝒲​\(A1−μ\)−σ​P𝒲​Ξ\)\.P\_\{\\mathcal\{W\}\}A\_\{1\}\-P\_\{\\mathcal\{W\}\}A\_\{t\}=\(1\\\!\-\\\!t\)\\Bigl\(P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\-\\sigma P\_\{\\mathcal\{W\}\}\\Xi\\Bigr\)\.\(21\)
Squaring \([21](https://arxiv.org/html/2605.13959#A2.E21)\) produces a\(1−t\)2\(1\-t\)^\{2\}factor that cancels the1/\(1−t\)21/\(1\-t\)^\{2\}prefactor in \([18](https://arxiv.org/html/2605.13959#A2.E18)\), leaving an integrand independent oftt\. Substituting and integrating overt∈\[0,1\]t\\in\[0,1\]yields

ℬ𝒲\(o\)≤𝔼\[∥P𝒲\(A1−μ\)−σP𝒲Ξ∥22\|o\]\.\\mathcal\{B\}\_\{\\mathcal\{W\}\}\(o\)\\leq\\mathbb\{E\}\\\!\\left\[\\bigl\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\-\\sigma P\_\{\\mathcal\{W\}\}\\Xi\\bigr\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]\.\(22\)

#### Step 3: decompose into mismatch and noise\.

Expand the squared norm:

‖P𝒲​\(A1−μ\)−σ​P𝒲​Ξ‖22=‖P𝒲​\(A1−μ\)‖22\+σ2​‖P𝒲​Ξ‖22−2​σ​⟨P𝒲​\(A1−μ\),P𝒲​Ξ⟩\.\\bigl\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\-\\sigma P\_\{\\mathcal\{W\}\}\\Xi\\bigr\\\|\_\{2\}^\{2\}=\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\\\|\_\{2\}^\{2\}\+\\sigma^\{2\}\\\|P\_\{\\mathcal\{W\}\}\\Xi\\\|\_\{2\}^\{2\}\-2\\sigma\\bigl\\langle P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\),\\,P\_\{\\mathcal\{W\}\}\\Xi\\bigr\\rangle\.\(23\)Taking the conditional expectation givenoo, the cross term vanishes\. Indeed, by assumption,Ξ\\Xiis conditionally independent of\(A1,μ\)\(A\_\{1\},\\mu\)givenoo, and has zero mean, so

𝔼\[⟨P𝒲\(A1−μ\),P𝒲Ξ⟩\|o\]=0\.\\mathbb\{E\}\\\!\\left\[\\bigl\\langle P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\),\\,P\_\{\\mathcal\{W\}\}\\Xi\\bigr\\rangle\\,\\middle\|\\,o\\right\]=0\.Therefore

𝔼\[∥P𝒲\(A1−μ\)−σP𝒲Ξ∥22\|o\]=𝔼\[∥P𝒲\(A1−μ\)∥22\|o\]\+σ2𝔼\[∥P𝒲Ξ∥22\]\.\\mathbb\{E\}\\\!\\left\[\\bigl\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\-\\sigma P\_\{\\mathcal\{W\}\}\\Xi\\bigr\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]=\\mathbb\{E\}\\\!\\left\[\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]\+\\sigma^\{2\}\\mathbb\{E\}\\\!\\left\[\\\|P\_\{\\mathcal\{W\}\}\\Xi\\\|\_\{2\}^\{2\}\\right\]\.\(24\)SinceP𝒲P\_\{\\mathcal\{W\}\}is the orthogonal projection onto ad𝒲d\_\{\\mathcal\{W\}\}\-dimensional subspace andΞ∼𝒩​\(0,Id\)\\Xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),

𝔼​\[‖P𝒲​Ξ‖22\]=d𝒲\.\\mathbb\{E\}\[\\\|P\_\{\\mathcal\{W\}\}\\Xi\\\|\_\{2\}^\{2\}\]=d\_\{\\mathcal\{W\}\}\.Substituting into \([24](https://arxiv.org/html/2605.13959#A2.E24)\) and then into \([22](https://arxiv.org/html/2605.13959#A2.E22)\) yields

ℬ𝒲\(o\)≤𝔼\[∥P𝒲\(A1−μ\)∥22\|o\]\+σ2d𝒲,\\mathcal\{B\}\_\{\\mathcal\{W\}\}\(o\)\\leq\\mathbb\{E\}\\\!\\left\[\\\|P\_\{\\mathcal\{W\}\}\(A\_\{1\}\\\!\-\\\!\\mu\)\\\|\_\{2\}^\{2\}\\,\\middle\|\\,o\\right\]\+\\sigma^\{2\}d\_\{\\mathcal\{W\}\},which is exactly \([16](https://arxiv.org/html/2605.13959#A2.E16)\)\. ∎

#### Interpretation\.

Proposition[B\.2](https://arxiv.org/html/2605.13959#A2.Thmtheorem2)shows that, on the warm coordinates, the branching cost is controlled by only two quantities:*\(i\)*the mismatch between the WarmPrior meanμ\\muand the targetA1A\_\{1\}, and*\(ii\)*the residual Gaussian noise levelσ\\sigma\. Thus, on the warm coordinates, WarmPrior becomes straighter when its mean is closer to the target and when its residual noise is smaller\. This explains the ordering of our variants\. For Preview, the training construction makes the warm mean target\-aligned, so the mismatch term vanishes and only theσ2​d𝒲\\sigma^\{2\}d\_\{\\mathcal\{W\}\}term remains\. For Past, the mean is only an approximation to the current target chunk, so an additional residual mismatch term remains\. The vanilla Gaussian baseline corresponds to a source mean that is far less aligned with the target, and therefore incurs a much larger ambiguity term\.

## Appendix CTraining Details

[Table˜3](https://arxiv.org/html/2605.13959#A3.T3)lists all training hyperparameters used in this work\. Robomimic and MimicGen experiments share the same Diffusion Policy \(ChiTransformer\)Chiet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib8)\)backbone, which combines a TransformerVaswaniet al\.\([2017](https://arxiv.org/html/2605.13959#bib.bib32)\)trunk with a ResNet\-18Heet al\.\([2016](https://arxiv.org/html/2605.13959#bib.bib30)\)image encoder, GroupNormWu and He \([2018](https://arxiv.org/html/2605.13959#bib.bib31)\)normalization, and AdamWLoshchilov and Hutter \([2019](https://arxiv.org/html/2605.13959#bib.bib29)\)optimization; the two settings differ only in batch size and iteration count\. For the real\-robot experiments we fine\-tune GR00T N1\.5\-3BBjorcket al\.\([2025a](https://arxiv.org/html/2605.13959#bib.bib15),[b](https://arxiv.org/html/2605.13959#bib.bib40)\), whose vision tower uses SigLIP\-So400mZhaiet al\.\([2023](https://arxiv.org/html/2605.13959#bib.bib34)\), language backbone uses Qwen3\-1\.7BQwen Team \([2025](https://arxiv.org/html/2605.13959#bib.bib41)\)embedded in the Eagle 2\.5\-VL stackChenet al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib42)\), and action head uses a DiTPeebles and Xie \([2023](https://arxiv.org/html/2605.13959#bib.bib33)\)module; we keep the LLM and vision tower frozen and update only the action\-head projector and the DiT module\.

Table 3:Training hyperparameters across all experiments\.Robomimic and MimicGen use the ChiTransformer flow\-matching backbone; real\-robot experiments fine\-tune GR00T N1\.5\-3B with the LLM and vision tower frozen\. “—” marks rows that do not apply to a given setup\.RobomimicMimicGenGR00T N1\.5\(state / image\)\(image\)\(real Franka\)*Architecture*BackboneChiTransformerChiTransformerGR00T N1\.5\-3BEmbedding dim384384—Transformer layers88—Attention heads66—Timestep emb\. dim128128—Attention dropout0\.10\.1—Image encoderResNet\-18 \(ImageNet\)ResNet\-18 \(ImageNet\)SigLIP\-So400m \(frozen\)LLM——Qwen3\-1\.7B \(frozen\)VLM backbone——Eagle 2\.5\-VLImage input size84×8484\{\\times\}8484×8484\{\\times\}84224×224224\{\\times\}224RGB cameras \(image\)2 \(1 TPV \+ 1 wrist\)\(Transport: 2×\\times\(1 TPV \+ 1 wrist\)\)2 \(1 TPV \+ 1 wrist\)3 \(2 TPV \+ 1 wrist\)Image augmentation76×7676\{\\times\}76random crop,GroupNorm76×7676\{\\times\}76random crop,GroupNorm0\.950\.95\-scale random crop,resize to224×224224\{\\times\}224,color jitterState/action normalizationper\-key min–maxper\-key min–maxper\-key min–maxTuned componentsallallaction\-head projector \+ DiT*Optimization*OptimizerAdamWAdamWAdamWLearning rate1×10−41\{\\times\}10^\{\-4\}1×10−41\{\\times\}10^\{\-4\}1×10−41\{\\times\}10^\{\-4\}Weight decay1×10−51\{\\times\}10^\{\-5\}1×10−51\{\\times\}10^\{\-5\}1×10−51\{\\times\}10^\{\-5\}Adam\(β1,β2\)\(\\beta\_\{1\},\\beta\_\{2\}\)\(0\.9,0\.999\)\(0\.9,\\,0\.999\)\(0\.9,0\.999\)\(0\.9,\\,0\.999\)\(0\.95,0\.999\)\(0\.95,\\,0\.999\)Adamϵ\\epsilon1×10−81\{\\times\}10^\{\-8\}1×10−81\{\\times\}10^\{\-8\}1×10−81\{\\times\}10^\{\-8\}LR schedulewarmup \+ cosinewarmup \+ cosinewarmup \+ cosineWarmup ratio0\.200\.200\.05Gradient accumulation111Gradient checkpointingnononoMixed precisionFP32FP32bf16 \+ tf32EMA rate0\.9950\.995—Batch size1024 / 25612832Iterations200,00050,00020,000Training seeds333*Policy and data*Action\-chunk lengthHH8816Action dim7–14 \(per task\)7–14 \(per task\)7Observation steps221State dim9–53 \(per task\)9–53 \(per task\)7Demonstrations per task250 \(PH\) / 300 \(MH\)1030InterpolantlinearlinearlinearLossflow matchingflow matchingflow matchingWP\-Past noise scaleσ\\sigma0\.50\.50\.5WP\-Preview noise scaleσ\\sigma1\.01\.01\.0*Evaluation*Inference NFE\{1,3,9\}\\\{1,3,9\\\}\{1,3,9\}\\\{1,3,9\\\}4Episodes per \(task, seed\)20020050Top\-KKcheckpoint averagingK=3K=3K=3K=3K=1K=1Parallel envs2020— \(real\)

## Appendix Dσ\\sigmaAblation

The bound in[Equation˜7](https://arxiv.org/html/2605.13959#S5.E7)predicts a non\-monotone dependence on the prior stdσ\\sigma: too large and the irreducibleσ2​d𝒲\\sigma^\{2\}d\_\{\\mathcal\{W\}\}term dominates, making the field bend to absorb a wide source; too small and the source concentrates onto the imperfect prior meanμ\\muwith no slack to absorb the persistence residual \(WP\-Past\) or the forecast error \(WP\-Preview\)\. We empirically validate this trade\-off on the most multimodal Robomimic task,Square\-MH, by sweepingσ∈\{1\.5,1\.0,0\.5,0\.3,0\.1,0\.05,0\}\\sigma\\in\\\{1\.5,\\,1\.0,\\,0\.5,\\,0\.3,\\,0\.1,\\,0\.05,\\,0\\\}and evaluating each configuration with three seeds atNFE=1\\text\{NFE\}=1andH=8H=8\.[Figure˜10](https://arxiv.org/html/2605.13959#A4.F10)reports the resulting success rate and seed standard deviation\.

![Refer to caption](https://arxiv.org/html/2605.13959v1/x10.png)\(a\)WP\-Preview: peak atσ=1\.0\\sigma=1\.0\.
![Refer to caption](https://arxiv.org/html/2605.13959v1/x11.png)\(b\)WP\-Past: peak atσ=0\.5\\sigma=0\.5\.

Figure 10:σ\\sigmaablation on Square\-MH\(NFE=1\\text\{NFE\}=1,H=8H=8, three seeds\)\. Shaded band is±1\\pm 1seed std\. The right end \(σ=0\\sigma=0\) is the regression limit\. The persistence prior of WP\-Past carries more residual error than the WP\-Preview forecast, so it benefits from a tighter source \(σ=0\.5\\sigma=0\.5vsσ=1\.0\\sigma=1\.0\)\.#### Findings\.

WP\-Preview peaks atσ=1\.0\\sigma=1\.0withSR=0\.778\\mathrm\{SR\}=0\.778, while WP\-Past peaks at the smallerσ=0\.5\\sigma=0\.5withSR=0\.701\\mathrm\{SR\}=0\.701\. This ordering is consistent with the role ofμ\\muin[Equation˜7](https://arxiv.org/html/2605.13959#S5.E7): Past carries the persistence residualRRin its mean\-mismatch term, so concentrating the source ontoμ\\muvia smallσ\\sigmais more costly for Past than for Preview, and Past’s optimum is pushed toward a smallerσ\\sigmawhere theσ2​d𝒲\\sigma^\{2\}d\_\{\\mathcal\{W\}\}penalty is reduced enough to compensate\. Preview’s smaller forecast errorEEleaves the mean\-mismatch term less sensitive to concentration, so its optimum sits where coverage of the multimodal target dominates the trade\-off \(σ=1\.0\\sigma=1\.0\)\. Both curves exhibit a broad plateau followed by a sharp collapse: WP\-Preview stays within0\.060\.06of its optimum acrossσ∈\[0\.3,1\.5\]\\sigma\\in\[0\.3,1\.5\], and WP\-Past stays within0\.080\.08of its optimum acrossσ∈\[0\.3,1\.0\]\\sigma\\in\[0\.3,1\.0\], after which performance falls steeply forσ≤0\.1\\sigma\\leq 0\.1\. The plateau makes the choice ofσ\\sigmaforgiving in the moderate\-noise regime\.

#### Fixedσ\\sigmaacross tasks\.

Based on this ablation we fixσ=1\.0\\sigma=1\.0for WP\-Preview andσ=0\.5\\sigma=0\.5for WP\-Past for*all*Robomimic, MimicGen, and real\-robot experiments reported in the main paper\.*We did not tuneσ\\sigmaper task\.*The plateau in[Figure˜10](https://arxiv.org/html/2605.13959#A4.F10)indicates that the method is robust to the choice ofσ\\sigmain the moderate\-noise regime, and the consistent gains obtained with these fixed values across eight benchmark tasks and a real\-robot suite in[Table˜1](https://arxiv.org/html/2605.13959#S3.T1)confirm that a single setting transfers cleanly across embodiments and task difficulties without per\-task tuning\.

#### Theσ→0\\sigma\\\!\\to\\\!0limit\.

The right end of both curves \(σ=0\\sigma=0\) corresponds to a deterministic sourcea0=μa\_\{0\}=\\mu, at which point the policy reduces to a regression\-style mappingμ↦a1\\mu\\mapsto a\_\{1\}rather than a stochastic generative sampler\. This is the regime explored by A2AJiaet al\.\([2026](https://arxiv.org/html/2605.13959#bib.bib22)\), which encodes the action history into a deterministic latent source and composes a deterministic ODE on top\. Such a deterministic prior accelerates training convergence because the source is no longer randomized, but it also collapses the conditionalp​\(a∣o\)p\(a\\mid o\)to a single mode, giving up the multimodal coverage that motivates generative imitation in the first place\.[Figure˜10](https://arxiv.org/html/2605.13959#A4.F10)makes this concrete: theσ=0\\sigma=0end\-point drops toSR≈0\.31\\mathrm\{SR\}\\approx 0\.31for Preview and toSR≈0\.05\\mathrm\{SR\}\\approx 0\.05for Past, an essentially complete failure\. The Past collapse is the more severe of the two because its prior mean is the previously executed chunk: without injected noise the policy is asked to map the past chunk directly to the next chunk through a network that never saw such a deterministic source–target pairing during training\. The full WarmPrior withσ\>0\\sigma\>0retains the generative structure while still exploiting the temporally grounded prior, and our ablation shows that this stochastic regime is where the success rate is maximized\.

## Appendix EComparing WarmPrior with Real\-Time Chunking

WarmPrior and Real\-Time Chunking \(RTC\)Blacket al\.\([2025b](https://arxiv.org/html/2605.13959#bib.bib24)\)both exploit the fact that the previously executed action chunk carries a great deal of information about the next one, yet they intervene at different points in the policy stack\. RTC is an inference\-time procedure: at each new decision step the policy regenerates the next chunk while clamping its early positions to the actions still being executed, so the freshly generated chunk is forced to commit to the same mode as the one it overlaps with\. WarmPrior is a training\-time prior shaping mechanism \([Section˜3](https://arxiv.org/html/2605.13959#S3)\): it anchors the source distribution on the previous chunk and trains the velocity field under that anchored coupling, so the learned flow itself is shorter and straighter without any inference\-time inpainting\. Because both mechanisms read from the same “past chunk” signal, it is natural to ask whether they are merely two encodings of the same gain\. This section disentangles the two\.

#### Setup\.

This experiment uses a different backbone from[Section˜4\.2](https://arxiv.org/html/2605.13959#S4.SS2): we evaluate on theπ0\.5\\pi\_\{0\.5\}vision\-language\-action modelPhysical Intelligenceet al\.\([2025](https://arxiv.org/html/2605.13959#bib.bib23)\), whose flow\-matching action head is the natural backbone to test alongside RTC\. The remainder of the real\-robot pipeline matches[Section˜4\.2](https://arxiv.org/html/2605.13959#S4.SS2): a Franka Research 3 with teleoperated demonstrations collected on the DROID platformKhazatskyet al\.\([2024](https://arxiv.org/html/2605.13959#bib.bib39)\), three training seeds, and 20 evaluation trials per seed\. We deliberately pick two tasks where RTC is known to help, namely the dynamic and precision\-sensitive*Block Throwing*and*Towel Folding*\([Figure˜12](https://arxiv.org/html/2605.13959#A5.F12)\), and ask whether WarmPrior also gains in this regime\.

We compare four configurations of the same flow\-matching backbone:Base\(vanilla𝒩​\(0,I\)\\mathcal\{N\}\(0,I\)prior, independent per\-chunk inference\),RTC\(the Base policy executed under the real\-time chunking inference procedure\),WarmPrior\(WP\-Preview with the temporally grounded prior, independent per\-chunk inference\), andRTC\+WarmPrior\(the WP\-Preview policy executed under the same RTC procedure\)\. The combined configuration is the natural “stack” of the two interventions: WP\-Preview reshapesp0p\_\{0\}at training time, and RTC additionally clamps the executing portion of the trajectory at inference time\.

![Refer to caption](https://arxiv.org/html/2605.13959v1/x12.png)Figure 11:RTC comparison tasks\.The two highly dynamic real\-robot scenes used in[Figure˜12](https://arxiv.org/html/2605.13959#A5.F12):*Block Throwing*and*Towel Folding*\. Both involve fast, committed whole\-arm motions where mode\-switching across chunk boundaries is particularly visible\.
![Refer to caption](https://arxiv.org/html/2605.13959v1/x13.png)Figure 12:RTC vs\. WarmPrior on highly dynamic tasks\.Real\-robot success rate ofπ0\.5\\pi\_\{0\.5\}\(mean and seed standard deviation over three training seeds, 20 trials per seed\)\. RTC and WarmPrior each improve over the baseline, and the combination exceeds both, suggesting their gains come from distinct mechanisms\.

#### Findings\.

[Figure˜12](https://arxiv.org/html/2605.13959#A5.F12)reports per\-task success rate, and three observations follow\.

*\(i\) RTC is effective on highly dynamic tasks\.*RTC nearly doubles the baseline on*Block Throwing*\(0\.32→0\.570\.32\\\!\\to\\\!0\.57\) and lifts*Towel Folding*from0\.500\.50to0\.670\.67\. This is consistent with the picture in which mode\-switching across chunk boundaries is most damaging when the underlying motion is fast and committed, exactly the regime where RTC’s explicit inpainting suppresses inter\-chunk discontinuities\.

*\(ii\) WarmPrior also provides consistent gains\.*WarmPrior alone improves both tasks \(0\.32→0\.480\.32\\\!\\to\\\!0\.48on*Block Throwing*,0\.50→0\.680\.50\\\!\\to\\\!0\.68on*Towel Folding*\), and on*Towel Folding*its gain is comparable to that of RTC \(0\.680\.68vs0\.670\.67\)\. This is the picture predicted by[Section˜5\.1](https://arxiv.org/html/2605.13959#S5.SS1): the prior mean drawn from the previous chunk reduces endpoint ambiguity for the velocity field on*both*tasks, with the bound of[Equation˜7](https://arxiv.org/html/2605.13959#S5.E7)tightened by exactly the same temporally grounded signal that RTC also exploits\.

*\(iii\) Combining the two yields an additional improvement\.**RTC\+WarmPrior*reaches0\.620\.62on*Block Throwing*and0\.820\.82on*Towel Folding*, exceeding both individual methods on both tasks; the increment is largest on*Towel Folding*, where the combination \(0\.820\.82\) sits well above either RTC alone \(0\.670\.67\) or WarmPrior alone \(0\.680\.68\)\. If the two methods relied on the same underlying effect, stacking them would saturate and produce no further gain\. The fact that they compound is evidence that they reach their success rate via distinct mechanisms\. We summarize the picture as follows\.RTCenforces*explicit mode commitment*at inference time: by clamping the early portion of the flow to the chunk currently being executed, it guarantees zero discontinuity at the boundary, which is what stabilizes fast, committed motions across chunk transitions\.WarmPriorreshapes the training\-time coupling so that the learned velocity field is itself straighter, in the OT\-aligned sense quantified by[Table˜2](https://arxiv.org/html/2605.13959#S5.T2)and bounded in[Equation˜7](https://arxiv.org/html/2605.13959#S5.E7)\. The two interventions address different failure modes of standard flow matching, namely curved learned flows \(training side\) and inter\-chunk discontinuities \(inference side\), so combining them removes both at once\.

#### Practical implications\.

A practical consequence: RTC requires chunks long enough to leave a meaningful overlap window, and it commits the policy to the chunk currently being executed before re\-planning, which is awkward on tasks that demand fast within\-chunk reactivity \([Section˜5\.2](https://arxiv.org/html/2605.13959#S5.SS2)\)\. WarmPrior’sσ\\sigmaknob \([Equation˜7](https://arxiv.org/html/2605.13959#S5.E7),[Section˜5\.2](https://arxiv.org/html/2605.13959#S5.SS2)\) instead supplies a continuous trade\-off between*temporal commitment*and*multimodal expressiveness*that remains operative even atH=1H=1, where action chunking is effectively disabled\. WarmPrior should therefore be read as a complement to RTC when chunking is available, and as a viable alternative when it is not\.

Similar Articles

Spatiotemporal Imputation with Graph-Informed Flow Matching

arXiv cs.LG

GiFlow is a graph-informed flow matching framework for spatiotemporal imputation that replaces Gaussian priors with a graph-informed prior, and uses a hybrid vector field model combining spatial attention, temporal attention, and spatiotemporal propagation. It outperforms state-of-the-art methods on synthetic and real-world datasets.

Follow the Mean: Reference-Guided Flow Matching

Hugging Face Daily Papers

This paper introduces a method for controllable generation in flow matching by adjusting the conditional endpoint mean using a reference set, offering both training-free and semi-parametric guidance for style and content control.