Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition

arXiv cs.LG 07/02/26, 04:00 AM Papers
reinforcement-learning flow-map generative-models few-step text-to-image consistency-models rl-post-training
Summary
Proposes Flow-Map GRPO, an online RL post-training framework for deterministic few-step flow-map generators, introducing Anchored Stochastic Flow Map Composition (ASFMC) to enable stochastic optimization without altering original model parameterization. Experiments on FLUX-based MeanFlow and sCM show improvement across reward-based, perceptual, and task-level metrics.
arXiv:2607.00535v1 Announce Type: new Abstract: Few-step flow-map generators, such as consistency models and MeanFlow, accelerate sampling by directly learning long-range transport maps between noise and data. However, these models are typically deterministic, which makes them difficult to optimize with reinforcement learning (RL) post-training methods that require stochastic trajectories and well-defined likelihood ratios. Existing SDE-based stochasticization techniques are designed for velocity-based samplers with infinitesimal or finely discretized transitions, and therefore do not directly apply to long-range flow maps. In this work, we propose Flow-Map GRPO, an online RL post-training framework for deterministic few-step flow-map generators. The key component is Anchored Stochastic Flow Map Composition (ASFMC), a path-preserving stochasticization mechanism that introduces randomness through anchor-based conditional resampling while preserving the original marginal probability path of the deterministic flow map. We derive GRPO objectives for both single-time and two-time flow-map parameterizations. Experiments on few-step FLUX-based text-to-image generators, including MeanFlow and sCM, show that Flow-Map GRPO improves pretrained deterministic flow-map models across reward-based, perceptual, and task-level evaluation metrics. Our results demonstrate that deterministic few-step flow-map generators can be effectively aligned with RL post-training without modifying their original model parameterization or retraining them as native stochastic models.
Original Article
View Cached Full Text
Cached at: 07/02/26, 05:38 AM
# Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition
Source: [https://arxiv.org/html/2607.00535](https://arxiv.org/html/2607.00535)
###### Abstract

Few\-step flow\-map generators, such as consistency models and MeanFlow, accelerate sampling by directly learning long\-range transport maps between noise and data\. However, these models are typically deterministic, which makes them difficult to optimize with reinforcement learning \(RL\) post\-training methods that require stochastic trajectories and well\-defined likelihood ratios\. Existing SDE\-based stochasticization techniques are designed for velocity\-based samplers with infinitesimal or finely discretized transitions, and therefore do not directly apply to long\-range flow maps\. In this work, we proposeFlow\-Map GRPO, an online RL post\-training framework for deterministic few\-step flow\-map generators\. The key component isAnchored Stochastic Flow Map Composition \(ASFMC\), a path\-preserving stochasticization mechanism that introduces randomness through anchor\-based conditional resampling while preserving the original marginal probability path of the deterministic flow map\. We derive GRPO objectives for both single\-time and two\-time flow\-map parameterizations\. Experiments on few\-step FLUX\-based text\-to\-image generators, including MeanFlow and sCM, show that Flow\-Map GRPO improves pretrained deterministic flow\-map models across reward\-based, perceptual, and task\-level evaluation metrics\. Our results demonstrate that deterministic few\-step flow\-map generators can be effectively aligned with RL post\-training without modifying their original model parameterization or retraining them as native stochastic models\.

11footnotetext:Georgia Institute of Technology, Atlanta, GA, USA\. Correspondence to: Zhiqi Li<zli3167@gatech\.edu\>\.## 1Introduction

Diffusion models and continuous\-time flow\-based generative models have become a dominant paradigm for high\-quality image and video generationRombachet al\.\([2022](https://arxiv.org/html/2607.00535#bib.bib45)\); Hoet al\.\([2022](https://arxiv.org/html/2607.00535#bib.bib32)\); Daoet al\.\([2023](https://arxiv.org/html/2607.00535#bib.bib81)\)\. These methods construct a probability path between a simple prior distribution and the data distribution, and learn either a score field or a velocity field that defines a continuous\-time generative process\. In particular, flow\-based approaches such as Flow Matching and Rectified Flow represent generation through a probability\-flow ODE, which enables principled sampling by numerically integrating the learned velocity field\. Despite their strong theoretical grounding, however, ODE\-based sampling typically requires many discretization steps, leading to high computational cost at inference time\.

This has motivated recent advances in few\-step flow\-map\-based generative models, such as Consistency ModelsSonget al\.\([2023](https://arxiv.org/html/2607.00535#bib.bib102)\); Genget al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib86)\)and MeanFlowGenget al\.\([2025a](https://arxiv.org/html/2607.00535#bib.bib76);[b](https://arxiv.org/html/2607.00535#bib.bib77)\); Liet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib74)\)\. Instead of learning only the instantaneous dynamics, these methods directly learn long\-range mappingsψt→r\\psi\_\{t\\to r\}that map samples between two time points\. By amortizing numerical integration into learned long\-range mappings, flow\-map\-based models can replace iterative ODE solvers with one\-step or few\-step generation\.

However, deterministic few\-step flow maps pose a difficulty for RL post\-training, which aims to align a pretrained generator with task\-level rewards while preserving its original flow\-map parameterization and learned marginal probability path\. Recent reinforcement learning \(RL\) post\-training methods for generative models, such as DDPOBlacket al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib152)\)and Flow\-GRPOLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\), formulate sampling as a Markov decision process and optimize the generative policy using task\-level rewards\. A key requirement of these methods is a well\-defined stochastic transition kernel, which is needed both for trajectory\-level exploration and for computing likelihood ratios in policy\-gradient optimization\. Diffusion models naturally provide such stochastic transitions through their denoising process, while velocity\-based flow models can be equipped with stochastic transitions through path\-preserving SDE reformulationsLipmanet al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib134)\)\. In contrast, flow\-map\-based models directly parameterize deterministic long\-range transportsψt→r\\psi\_\{t\\to r\}, which do not define stochastic trajectories or transition likelihoods\. This creates a structural mismatch between deterministic few\-step flow\-map generators and likelihood\-ratio\-based RL post\-training methodsBoffiet al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib100)\)\.

![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/teaser.jpg)Figure 1:Comparison of base generation and FlowMap\-GRPO post\-trained generation\.A natural attempt is to introduce stochasticity by using a path\-preserving SDE reformulation of the underlying ODE dynamics\. While this works for velocity\-based samplers with infinitesimal or finely discretized transitions, it does not directly extend to long\-range flow maps\. A long\-range SDE transition depends on the entire stochastic path between the source and target times and, in general, cannot be reduced to a Gaussian transition centered at the deterministic flow\-map output \(see[3\.1](https://arxiv.org/html/2607.00535#S3.Thmtheorem1)\)\. Exact SDE\-based stochasticization would therefore require stochastic integration, undermining the computational advantage of few\-step flow maps\. This motivates a stochasticization mechanism that operates directly at the level of flow maps while preserving their probability path\.

To address this limitation, we proposeFlow\-Map GRPO, an online RL post\-training framework for deterministic few\-step flow\-map generators\. Its key component isAnchored Stochastic Flow Map Composition \(ASFMC\), a path\-preserving stochasticization mechanism that introduces randomness at the level of long\-range flow maps\. Instead of perturbing infinitesimal dynamics, ASFMC uses an auxiliary anchor variable: given a deterministic transition fromtttorr, it first maps the sample to the target time, transports this deterministic endpoint to an anchor timeτ\\tau, and then resamples back to timerrthrough a conditional transition of the probability path\. In this way, the anchor connects deterministic flow\-map composition with stochastic conditional resampling, producing stochastic transitions while preserving the marginal distribution learned by the original model\. With ASFMC, each flow\-map step becomes a stochastic policy transition, allowing few\-step flow\-map sampling to be formulated as an MDP and optimized with GRPO\-style likelihood ratios\. The construction is post\-hoc and does not modify the pretrained flow\-map parameterization\. We instantiate it for both major flow\-map forms: one\-time endpoint maps, such as sCM, and two\-time maps, such as MeanFlow, where ASFMC can use local or endpoint anchors\.Our contributions are as follows:

- •We identify a fundamental limitation of SDE\-based stochasticization for deterministic flow maps, showing that it is not directly applicable to long\-range flow\-map transitions\.
- •We propose Anchored Stochastic Flow Map Composition \(ASFMC\), a principled stochastic flow\-map construction that preserves the marginal probability path while enabling trajectory\-level exploration\.
- •We introduce Flow\-Map GRPO, a unified RL post\-training framework for deterministic few\-step flow\-map generators, and derive consistent formulations for both single\-time and two\-time flow\-map parameterizations\.
- •We empirically validate Flow\-Map GRPO on few\-step FLUX\-based text\-to\-image generators, demonstrating improvements over pretrained MeanFlow and sCM checkpoints across reward\-based, perceptual, and task\-level evaluation metrics\.

## 2Background

### 2\.1Multi\-Step Flow Models and Few\-step Flow Maps

Let𝒟=\{xi∈𝒳\}i=1n\\mathcal\{D\}=\\\{x^\{i\}\\in\\mathcal\{X\}\\\}\_\{i=1\}^\{n\}denote samples from an unknown data distributionp1=pdatap\_\{1\}=p\_\{\\mathrm\{data\}\}on𝒳⊂ℝd\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}\. A continuous\-time generative model constructs a stochastic process\{Xt\}t∈\[0,1\]\\\{X\_\{t\}\\\}\_\{t\\in\[0,1\]\}whose marginal distributionptp\_\{t\}evolves from a simple base distributionp0p\_\{0\}, typically a standard Gaussian, to the data distributionp1p\_\{1\}, forming a probability path\{pt\}t=01\\\{p\_\{t\}\\\}\_\{t=0\}^\{1\}\. The probability path can be described by a deterministic probability\-flow ODE\. Given an initial samplex0∼p0x\_\{0\}\\sim p\_\{0\}, its trajectory\{xt\}t∈\[0,1\]\\\{x\_\{t\}\\\}\_\{t\\in\[0,1\]\}evolves according to

dxtdt=ut\(xt\),x0∼p0,\\frac\{dx\_\{t\}\}\{dt\}=u\_\{t\}\(x\_\{t\}\),\\qquad x\_\{0\}\\sim p\_\{0\},\(1\)whereut:𝒳→ℝdu\_\{t\}:\\mathcal\{X\}\\rightarrow\\mathbb\{R\}^\{d\}is a time\-dependent velocity field\. The solution of[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)defines a time\-dependent flowψt:𝒳→𝒳\\psi\_\{t\}:\\mathcal\{X\}\\rightarrow\\mathcal\{X\}withψ0\(x\)=x\\psi\_\{0\}\(x\)=x, such thatxt=ψt\(x0\)x\_\{t\}=\\psi\_\{t\}\(x\_\{0\}\)\.\{Xt\}t∈\[0,1\]\\\{X\_\{t\}\\\}\_\{t\\in\[0,1\]\}is called a flow modelLipmanet al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib134)\)if it is generated by[Equation 1](https://arxiv.org/html/2607.00535#S2.E1), namelyXt=ψt\(X0\)X\_\{t\}=\\psi\_\{t\}\(X\_\{0\}\),X0∼p0X\_\{0\}\\sim p\_\{0\}, and the corresponding marginal distribution can be induced by the pushforwardpt=\[ψt\]♯p0p\_\{t\}=\[\\psi\_\{t\}\]\_\{\\sharp\}p\_\{0\}, satisfying the continuity equation

∂tpt\(x\)=−∇⋅\(pt\(x\)ut\(x\)\)\.\\partial\_\{t\}p\_\{t\}\(x\)=\-\\nabla\\cdot\(p\_\{t\}\(x\)u\_\{t\}\(x\)\)\.\(2\)Flow models cover major continuous\-time generative models, including Flow MatchingLipmanet al\.\([2022](https://arxiv.org/html/2607.00535#bib.bib78)\), Rectified FlowLiuet al\.\([2023](https://arxiv.org/html/2607.00535#bib.bib98)\), and diffusion models such as DDIMSonget al\.\([2020a](https://arxiv.org/html/2607.00535#bib.bib79)\)\.

Flow models learnutθ\(x\)u\_\{t\}^\{\\theta\}\(x\)\(or an equivalent parameterization, such as epsilon prediction \) from𝒟\\mathcal\{D\}and generate samples by numerically solving[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)\. To trainutθ\(x\)u\_\{t\}^\{\\theta\}\(x\), flow models construct conditional pathsXt=atX1\+btX0X\_\{t\}=a\_\{t\}X\_\{1\}\+b\_\{t\}X\_\{0\}, whereata\_\{t\}andbtb\_\{t\}are scheduler functions satisfyinga1=b0=1a\_\{1\}=b\_\{0\}=1anda0=b1=0a\_\{0\}=b\_\{1\}=0\. The corresponding conditional velocity target isut\(Xt\|X1,X0\)=a˙tX1\+b˙tX0u\_\{t\}\(X\_\{t\}\|X\_\{1\},X\_\{0\}\)=\\dot\{a\}\_\{t\}X\_\{1\}\+\\dot\{b\}\_\{t\}X\_\{0\}\. After marginalizing overX0X\_\{0\}, the conditional path and velocity becomept\(⋅\|X1\)p\_\{t\}\(\\cdot\|X\_\{1\}\)andut\(⋅\|X1\)u\_\{t\}\(\\cdot\|X\_\{1\}\), whose marginalization again recovers the desired probability path and marginal velocity in[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)pt\(x\)=∫pt\(x\|x1\)p1\(x1\)𝑑x1p\_\{t\}\(x\)=\\int p\_\{t\}\(x\|x\_\{1\}\)p\_\{1\}\(x\_\{1\}\)dx\_\{1\}andut\(x\)=𝔼\[ut\(x\|X1\)\|Xt=x\]u\_\{t\}\(x\)=\\mathbb\{E\}\[u\_\{t\}\(x\|X\_\{1\}\)\|X\_\{t\}=x\]\. For the affine conditional path above, the conditional velocity can be written as

ut\(x\|x1\)=b˙tbtx\+\(a˙t−atb˙tbt\)x1,u\_\{t\}\(x\|x\_\{1\}\)=\\frac\{\\dot\{b\}\_\{t\}\}\{b\_\{t\}\}x\+\(\\dot\{a\}\_\{t\}\-\\frac\{a\_\{t\}\\dot\{b\}\_\{t\}\}\{b\_\{t\}\}\)x\_\{1\},\(3\)and the model is then trained with the surrogate objective

ℒc\(θ\)=𝔼t,x1∼p1,x∼pt\(⋅\|x1\)\[∥utθ\(x\)−ut\(x\|x1\)∥2\]\\mathcal\{L\}\_\{c\}\(\\theta\)=\\mathbb\{E\}\_\{t,x\_\{1\}\\sim p\_\{1\},x\\sim p\_\{t\}\(\\cdot\|x\_\{1\}\)\}\[\\\|u\_\{t\}^\{\\theta\}\(x\)\-u\_\{t\}\(x\|x\_\{1\}\)\\\|^\{2\}\]\(4\)The Flow Matching and DDIM paths can be recovered by choosing\(at,bt\)=\(t,1−t\)\(a\_\{t\},b\_\{t\}\)=\(t,1\-t\)and\(at,bt\)=\(1−\(1−t\)2,1−t\)\(a\_\{t\},b\_\{t\}\)=\(\\sqrt\{1\-\(1\-t\)^\{2\}\},1\-t\), respectively\. For conditional generation, the probability path and velocity field can be directly conditioned onccwithpt\(x\|x1,c\)p\_\{t\}\(x\|x\_\{1\},c\)andut\(x\|x1,c\)u\_\{t\}\(x\|x\_\{1\},c\)\.

The velocity fieldutθu^\{\\theta\}\_\{t\}only describes the instantaneous evolution of the generative process\. After discretization, it gives a short\-range transition fromtttot\+Δtt\+\\Delta t, so generating a sample requires repeatedly solving[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)over many steps\. To enable one\-step or few\-step generation, flow\-map\-based methods, including Consistency ModelsGenget al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib86)\)and MeanFlowGenget al\.\([2025a](https://arxiv.org/html/2607.00535#bib.bib76)\), directly learn a flow mapψt→rθ\\psi^\{\\theta\}\_\{t\\to r\}, defined asψt→r=ψt∘ψt−1\\psi\_\{t\\to r\}=\\psi\_\{t\}\\circ\\psi^\{\-1\}\_\{t\}, which enable one\-step generation asx1=ψ0→1θ\(x0\)x\_\{1\}=\\psi^\{\\theta\}\_\{0\\to 1\}\(x\_\{0\}\)and few\-step generation, e\.g\., two step asx0\.5=ψ0→0\.5θ\(x0\)x\_\{0\.5\}=\\psi^\{\\theta\}\_\{0\\to 0\.5\}\(x\_\{0\}\)andx1=ψ0\.5→1θ\(x0\.5\)x\_\{1\}=\\psi^\{\\theta\}\_\{0\.5\\to 1\}\(x\_\{0\.5\}\)\. Although flow\-map\-based methods have clear advantages in generation speed, they face a key training challenge that direct supervision forψt→r\\psi\_\{t\\to r\}is difficult to obtain from the datasetLiet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib74)\)\. In[Appendix A](https://arxiv.org/html/2607.00535#A1), we summarize common strategies for learning flow maps and discuss representative flow\-map\-based methods, including Consistency Models and MeanFlow\.

### 2\.2Reinforcement Learning for Flow Models

Recent worksLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\); Liet al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib151)\); Xueet al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib153)\); Blacket al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib152)\)aim to improve flow and diffusion models with task\-specific rewards, such as human preference, aesthetic quality, or downstream evaluation metrics\. These rewards are often non\-differentiable; therefore, these methods formulate generation as a reinforcement learning problem by treating the sampling trajectory as a multi\-step MDP\. For a conditional generative model with conditioncc, the state isst=\(c,t,xt\)s\_\{t\}=\(c,t,x\_\{t\}\), the action is the next sample along the trajectory,at=xt\+Δta\_\{t\}=x\_\{t\+\\Delta t\}, and the policy is the model transitionπθ\(at\|st\)=pθ\(xt\+Δt\|xt\|c\)\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)=p\_\{\\theta\}\(x\_\{t\+\\Delta t\}\|x\_\{t\}\|c\), the next\-state transition isP\(st\+Δt\|st,at\)=δ\(c,t\+Δt,at\)\(st\+Δt\)P\(s\_\{t\+\\Delta t\}\|s\_\{t\},a\_\{t\}\)=\\delta\_\{\(c,t\+\\Delta t,a\_\{t\}\)\}\(s\_\{t\+\\Delta t\}\)\. The initial state is drawn fromρ0\(s0\)=p\(c\)p0\(x0\)\\rho\_\{0\}\(s\_\{0\}\)=p\(c\)p\_\{0\}\(x\_\{0\}\), and the reward is assigned to the final generated sample,R\(st,at\)=r\(x1\|c\)R\(s\_\{t\},a\_\{t\}\)=r\(x\_\{1\}\|c\)whent=1t=1\. The model can then be optimized with policy\-gradient objectives based on the transition likelihoodpθ\(xt\+Δt\|xt\|c\)p\_\{\\theta\}\(x\_\{t\+\\Delta t\}\|x\_\{t\}\|c\)\.

The above formulation requires the policyπθ\(at\|st\)=pθ\(xt\+Δt\|xt\|c\)\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)=p\_\{\\theta\}\(x\_\{t\+\\Delta t\}\|x\_\{t\}\|c\)to be stochastic, so that the RL procedure can explore different trajectories and also the policy likelihood ratioπθ\(at\|st\)πθref\(at\|st\)\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{ref\}\}\}\(a\_\{t\}\|s\_\{t\}\)\}used for regularization sometimes is well\-defined rather than singular\. DDPOBlacket al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib152)\)instantiates this framework for diffusion models\. Since diffusion samplers, such as DDPM variantsHoet al\.\([2020](https://arxiv.org/html/2607.00535#bib.bib82)\), naturally define stochastic denoising transitions, their transition likelihoods can be directly used for policy optimization\. Flow\-GRPO extends the same idea to flow models\. Starting from the deterministic ODE in[Equation 1](https://arxiv.org/html/2607.00535#S2.E1), standard flow models do not directly provide stochastic actions or transition likelihoods for reinforcement learning\. To introduce stochasticity, Flow\-GRPO rewrites the deterministic flow dynamics as the following equivalent SDELipmanet al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib134)\):

dXt=\[ut\(Xt\)\+σt22∇xlog⁡pt\(Xt\)\]dt\+σtdWt,dX\_\{t\}=\[u\_\{t\}\(X\_\{t\}\)\+\\frac\{\\sigma\_\{t\}^\{2\}\}\{2\}\\nabla\_\{x\}\\log p\_\{t\}\(X\_\{t\}\)\]dt\+\\sigma\_\{t\}dW\_\{t\},\(5\)whereσt\\sigma\_\{t\}controls the stochasticity andWtW\_\{t\}denotes a standard Wiener process\. The score correction cancels the diffusion effect at the marginal level, so the marginal path\{p~t\}t=01\\\{\\tilde\{p\}\_\{t\}\\\}\_\{t=0\}^\{1\}induced by[Equation 5](https://arxiv.org/html/2607.00535#S2.E5)matches the probability path\{pt\}t=01\\\{p\_\{t\}\\\}\_\{t=0\}^\{1\}of the deterministic flow ODE in[Equation 1](https://arxiv.org/html/2607.00535#S2.E1), while retaining stochastic trajectories for GRPO exploration\.

## 3Method

![Refer to caption](https://arxiv.org/html/2607.00535v1/x1.png)\(a\)Infinitesimal transition
![Refer to caption](https://arxiv.org/html/2607.00535v1/x2.png)\(b\)Long\-range transition

Figure 2:Comparison of SDE\-induced stochasticization for infinitesimal and long\-range transitions\. For an infinitesimal transitionψt→t\+Δt\\psi\_\{t\\to t\+\\Delta t\}, the SDE corresponding to the deterministic ODE yields a simple Gaussian transition, enabling convenient stochasticizationψ~t→t\+ΔtSDE\{\\tilde\{\\psi\}\}^\{SDE\}\_\{t\\to t\+\\Delta t\}of instantaneous velocity\-based samplers\. In contrast, for a long\-range transitionψt→r\\psi\_\{t\\to r\}, the accumulated stochastic dynamics generally cannot be reduced to a simple Gaussian form, preventing its direct use as a stochasticization mechanismψ~t→rSDE\{\\tilde\{\\psi\}\}^\{SDE\}\_\{t\\to r\}for flow maps\.Existing RL post\-training methods for generative models mainly target multi\-step samplers, such as diffusion modelsBlacket al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib152)\)and flow matching modelsLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\), where the generation process naturally forms a stochastic sequential decision process\. In this work, we extend RL\-based alignment to flow\-map\-based few\-step generatorsGenget al\.\([2025a](https://arxiv.org/html/2607.00535#bib.bib76)\); Franset al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib97)\); Genget al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib86)\)\. A flow map defines a deterministic transport and therefore cannot be directly optimized with policy\-gradient methods that require stochastic trajectories, and the key problem is how to introduce stochasticity into flow\-map sampling without changing its underlying probability path, which becomes nontrivial due to the non\-infinitesimal nature of flow\-map dynamics \([subsection 3\.1](https://arxiv.org/html/2607.00535#S3.SS1)\)\. To address this problem, we propose Anchored Stochastic Flow Map Composition \(ASFMC\) method \([subsection 3\.2](https://arxiv.org/html/2607.00535#S3.SS2)\) and discuss how it can be instantiated for different flow\-map parameterizations, including both two\-parameter and one\-parameter forms \([subsection 3\.3](https://arxiv.org/html/2607.00535#S3.SS3)\)\. Importantly, our method does not change the definition of the deterministic flow map, making it directly applicable to post\-training existing flow\-map models\. We mainly instantiate our framework with GRPOShaoet al\.\([2024](https://arxiv.org/html/2607.00535#bib.bib154)\), while noting that the same stochasticization principle can be combined with other RL algorithms and optimization techniques \([section 4](https://arxiv.org/html/2607.00535#S4)\)\.

### 3\.1Challenge for Flow Maps

For RL post\-training, the randomized flow mapψ~t→r\\tilde\{\\psi\}\_\{t\\to r\}should satisfy two requirements\. First, it should induce stochastic trajectories, allowing policy optimization to explore different generation paths\. Second, it should preserve the original probability path of the pretrained flow map, so that the introduced stochasticity does not change the learned marginal distributionsLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\)\. We formalize this requirement as follows\.

###### Proposition 3\.1\(Path\-preserving stochasticization\)\.

Letψt→r\\psi\_\{t\\to r\}denote a deterministic flow map that induces the probability path\{pt=\[ψ0→t\]♯p0\}t=01\\\{p\_\{t\}=\[\\psi\_\{0\\to t\}\]\_\{\\sharp\}p\_\{0\}\\\}\_\{t=0\}^\{1\}\. A valid randomized flow mapψ~t→r\\tilde\{\\psi\}\_\{t\\to r\}for RL post\-training should induce stochastic sample trajectories while preserving the same marginal path, i\.e\., its induced probability path\{p~t=\[ψ~0→t\]♯p0\}t=01\\\{\\tilde\{p\}\_\{t\}=\[\\tilde\{\\psi\}\_\{0\\to t\}\]\_\{\\sharp\}p\_\{0\}\\\}\_\{t=0\}^\{1\}satisfiesp~t=pt\\tilde\{p\}\_\{t\}=p\_\{t\}for allt∈\[0,1\]t\\in\[0,1\]\.

A natural idea is to follow Flow\-GRPO and inject stochasticity through an equivalent SDE whose marginal distributions match those of the deterministic ODE\. For an infinitesimal step, the SDE induces a simple Gaussian transitionxt\+Δt=xt\+\[ut\(xt\)\+σt22∇xlog⁡pt\(xt\)\]Δt\+σtΔtϵtx\_\{t\+\\Delta t\}=x\_\{t\}\+\[u\_\{t\}\(x\_\{t\}\)\+\\frac\{\\sigma\_\{t\}^\{2\}\}\{2\}\\nabla\_\{x\}\\log p\_\{t\}\(x\_\{t\}\)\]\\Delta t\+\\sigma\_\{t\}\\sqrt\{\\Delta t\}\\epsilon\_\{t\}withϵt∼𝒩\(0,I\)\\epsilon\_\{t\}\\sim\\mathcal\{N\}\(0,I\), which allows stochasticity to be conveniently injected into velocity\-based samplers, as adopted by Flow\-GRPO\. However, flow maps are long\-range transport operators between arbitrary time pairs, and their transitions are not infinitesimal\. The long\-range transition induced by the SDE generally cannot be reduced to a simple additive\-noise form, i\.e\., a deterministic part plus an input\-independent Gaussian perturbation, as shown in[Figure 2](https://arxiv.org/html/2607.00535#S3.F2)\.

###### Theorem 3\.2\(Long\-range SDE transitions require stochastic integration\)\.

LetXsX\_\{s\}solve the path\-preserving SDE[Equation 5](https://arxiv.org/html/2607.00535#S2.E5)and letψt→r\\psi\_\{t\\to r\}be the deterministic flow map induced by the ODE[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)\. For a non\-infinitesimal intervalt<rt<r, the exact SDE transition is generally not representable as a simple additive Gaussian perturbation of a deterministic function of the flow maps:

Xr=Gt,r\(Xt;ψ,dψ,…\)\+At,rϵ,ϵ∼𝒩\(0,I\),X\_\{r\}=G\_\{t,r\}\(X\_\{t\};\\psi,d\\psi,\.\.\.\)\+A\_\{t,r\}\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(6\)whereAt,rA\_\{t,r\}is independent of the input sample andGt,r\(Xt;ψ,dψ,…\)G\_\{t,r\}\(X\_\{t\};\\psi,d\\psi,\\ldots\)denotes a deterministic functional of the flow\-map familyψ\\psiand its derivatives\. Instead, the exact transition must be expressed by the stochastic integralXr=Xt\+∫trbs\(Xs\)𝑑s\+∫trσs𝑑WsX\_\{r\}=X\_\{t\}\+\\int\_\{t\}^\{r\}b\_\{s\}\(X\_\{s\}\)ds\+\\int\_\{t\}^\{r\}\\sigma\_\{s\}dW\_\{s\}, wherebs\(x\)=us\(x\)\+σs22∇xlog⁡ps\(x\)b\_\{s\}\(x\)=u\_\{s\}\(x\)\+\\frac\{\\sigma\_\{s\}^\{2\}\}\{2\}\\nabla\_\{x\}\\log p\_\{s\}\(x\), except for special linear\-Gaussian dynamics\. See[subsection B\.1](https://arxiv.org/html/2607.00535#A2.SS1)for proof\.

As a result, the SDE\-based stochasticization used in Flow\-GRPO cannot be directly transferred to flow maps, which calls for a different stochasticization mechanism tailored to long\-range flow\-map transitions\.

### 3\.2Anchored Stochastic Flow Map Composition

![Refer to caption](https://arxiv.org/html/2607.00535v1/x3.png)\(a\)ASFMC principle
![Refer to caption](https://arxiv.org/html/2607.00535v1/x4.png)\(b\)Endpoint anchor
![Refer to caption](https://arxiv.org/html/2607.00535v1/x5.png)\(c\)Local anchor

Figure 3:Illustration of Anchored Stochastic Flow Map Composition \(ASFMC\)\. Left: ASFMC decomposes a long\-range transitionψt→r\\psi\_\{t\\to r\}into three flow\-map segments,t→rt\\to r,r→τr\\to\\tau, andτ→r\\tau\\to r, with anchor timeτ\\tauand anchor statexancx\_\{\\mathrm\{anc\}\}\. The reverse segmentr→τr\\to\\tausamples a stochastic intermediate state conditioned onXrX\_\{r\}, thereby injecting randomness into the original transitiont→rt\\to rwhile preserving the marginal path\. Middle and right: two valid choices of the anchor time,τ=1\\tau=1andτ=r\+Δr\\tau=r\+\\Delta rwith smallΔr\\Delta r, respectively\. We discuss another natural but invalid choice ofτ\\tauin[subsubsection 3\.2\.3](https://arxiv.org/html/2607.00535#S3.SS2.SSS3)\.To address this challenge, we propose*Anchored Stochastic Flow Map Composition*\(ASFMC\), which injects randomness into flow\-map sampling through an auxiliary anchor while preserving the original marginal path\. The key intuition is that although the forward flow\-map transitionψt→r\\psi\_\{t\\to r\}is deterministic, stochasticity can be introduced through a conditional transition, which is naturally stochastic, such as the noising or posterior transition in diffusion models\. Specifically, ASFMC first moves the deterministic endpoint at timerrto an anchor pointxancx\_\{\\mathrm\{anc\}\}at the anchor timeτ\\tau, and then samples back to timerrthrough the conditional distributionpr\|τ\(⋅\|xanc\)p\_\{r\|\\tau\}\(\\cdot\|x\_\{\\mathrm\{anc\}\}\)\. The anchor\(τ,xanc\)\(\\tau,x\_\{\\mathrm\{anc\}\}\)is therefore the shared interface between the deterministic compensation segment and the stochastic resampling segment, allowing ASFMC to combine flow\-map composition with stochastic resampling while keeping the overall transition fromtttorrunchanged\.

We now formalize the construction; see[3\(a\)](https://arxiv.org/html/2607.00535#S3.F3.sf1)for an illustration\. Given an input statextx\_\{t\}and a target timerr, the original deterministic flow map producesxr=ψt→r\(xt\)x\_\{r\}=\\psi\_\{t\\to r\}\(x\_\{t\}\)\. ASFMC then chooses an anchor timeτ\\tauand defines the anchor state by transporting this deterministic endpoint toτ\\tau:xanc=ψr→τ\(xr\)=ψr→τ∘ψt→r\(xt\)x\_\{\\mathrm\{anc\}\}=\\psi\_\{r\\to\\tau\}\(x\_\{r\}\)=\\psi\_\{r\\to\\tau\}\\circ\\psi\_\{t\\to r\}\(x\_\{t\}\)\. The randomized output at timerris obtained by sampling from the conditional distribution back to the target timex~r∼pr\|τ\(⋅\|xanc\)\\tilde\{x\}\_\{r\}\\sim p\_\{r\|\\tau\}\(\\cdot\|x\_\{\\mathrm\{anc\}\}\)\. Namely, ASFMC defines the stochastic flow map

ψ~t→r\(xt\)=x~r,x~r∼pr\|τ\(⋅\|ψr→τ∘ψt→r\(xt\)\)\.\\tilde\{\\psi\}\_\{t\\to r\}\(x\_\{t\}\)=\\tilde\{x\}\_\{r\},\\qquad\\tilde\{x\}\_\{r\}\\sim p\_\{r\|\\tau\}\(\\cdot\|\\psi\_\{r\\to\\tau\}\\circ\\psi\_\{t\\to r\}\(x\_\{t\}\)\)\.\(7\)In this construction, the segmentr→τr\\to\\tauforms an anchor state with marginalpτp\_\{\\tau\}, and the conditional transitionτ→r\\tau\\to rinjects randomness while returning to the target time\. Thus, ASFMC preserves the original transition times and replaces the deterministic endpoint with a stochastic sample from the correct marginal\. Moreover, this construction preserves the marginal distribution, since marginalizing the conditional transitionpr\|τp\_\{r\|\\tau\}over the anchor marginalpτp\_\{\\tau\}recoversprp\_\{r\}\.

###### Theorem 3\.3\(Path preservation of ASFMC\)\.

LetXt∼ptX\_\{t\}\\sim p\_\{t\}and defineXr=ψt→r\(Xt\)X\_\{r\}=\\psi\_\{t\\to r\}\(X\_\{t\}\),Xanc=ψr→τ\(Xr\)X\_\{\\mathrm\{anc\}\}=\\psi\_\{r\\to\\tau\}\(X\_\{r\}\)\. If the randomized outputX~r\\tilde\{X\}\_\{r\}is sampled from the reverse conditional distribution

X~r∼pr\|τ\(⋅\|Xanc\),\\tilde\{X\}\_\{r\}\\sim p\_\{r\|\\tau\}\(\\cdot\|X\_\{\\mathrm\{anc\}\}\),\(8\)thenX~r∼pr\\tilde\{X\}\_\{r\}\\sim p\_\{r\}\. Therefore, the stochastic flow mapψ~t→r\\tilde\{\\psi\}\_\{t\\to r\}induced by ASFMC preserves the marginal distribution ofψt→r\\psi\_\{t\\to r\}\.

We next discuss the choice of the anchor timeτ\\tau, which leads to different stochastic flow mapsψ~t→r\\tilde\{\\psi\}\_\{t\\to r\}\. We consider three choices that are useful in different settings: a local anchorτ=r\+Δr\\tau=r\+\\Delta rwith smallΔr\\Delta r, an endpoint anchorτ=1\\tau=1, and an intermediate anchorr<τ<1r<\\tau<1\.

![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_pickscore/prompt_013_grid.jpg)Figure 4:MeanFlow post\-training with PickScore reward\.Qualitative comparison between the base MeanFlow generator and the FlowMap\-GRPO post\-trained MeanFlow model on the prompt “eating pizza at a romantic date\.” The model is post\-trained using PickScore reward withK=4K=4stochastic flow\-map steps during training, and evaluated under multiple inference step budgets to examine cross\-step generalization\.#### 3\.2\.1Local Anchor

The local anchor chooses an anchor time close to the target time,τ=r\+Δr\\tau=r\+\\Delta r, whereΔr\>0\\Delta r\>0is small\. For the short anchor interval\[r,r\+Δr\]\[r,r\+\\Delta r\], the conditional transitionpr\|r\+Δrp\_\{r\|r\+\\Delta r\}can be approximated by the closed\-form local Gaussian transition induced by the SDE corresponding to the reverse direction of the ODE in[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)fromt=1t=1tot=0t=0, while preserving the marginal path of the reverse\-time dynamics:dXt=\[ut\(Xt\)−σt22∇xlog⁡pt\(Xt\)\]dt\+σtdWtdX\_\{t\}=\[u\_\{t\}\(X\_\{t\}\)\-\\frac\{\\sigma\_\{t\}^\{2\}\}\{2\}\\nabla\_\{x\}\\log p\_\{t\}\(X\_\{t\}\)\]dt\+\\sigma\_\{t\}dW\_\{t\}\.We use the diffusion scaleσr=λ2br\(a˙rbr−arb˙r\)ar\\sigma\_\{r\}=\\lambda\\sqrt\{\\frac\{2b\_\{r\}\(\\dot\{a\}\_\{r\}b\_\{r\}\-a\_\{r\}\\dot\{b\}\_\{r\}\)\}\{a\_\{r\}\}\}followingLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\)\. For a small stepΔr\>0\\Delta r\>0, the local transition fromτ\\tautorrtakes the form by discretization

X~r\\displaystyle\\tilde\{X\}\_\{r\}=Xτ−Δr\[uτ\(Xτ\)−στ22∇xlog⁡pτ\(Xτ\)\]\+στΔrξ\+O\(Δr3/2\)\\displaystyle=X\_\{\\tau\}\-\\Delta r\[u\_\{\\tau\}\(X\_\{\\tau\}\)\-\\frac\{\\sigma\_\{\\tau\}^\{2\}\}\{2\}\\nabla\_\{x\}\\log p\_\{\\tau\}\(X\_\{\\tau\}\)\]\+\\sigma\_\{\\tau\}\\sqrt\{\\Delta r\}\\xi\+O\(\\Delta r^\{3/2\}\)\(9\)=\(1−Δrστ22a˙rbr\(a˙rbr−arb˙r\)\)Xτ−Δr\(1−στ22arbr\(a˙rbr−arb˙r\)\)uτ\(Xτ\)\+στΔrξ\+O\(Δr3/2\)\\displaystyle=\(1\\\!\-\\\!\\frac\{\\Delta r\\sigma\_\{\\tau\}^\{2\}\}\{2\}\\frac\{\\dot\{a\}\_\{r\}\}\{b\_\{r\}\(\\dot\{a\}\_\{r\}b\_\{r\}\-a\_\{r\}\\dot\{b\}\_\{r\}\)\}\)X\_\{\\tau\}\\\!\-\\\!\\Delta r\(1\-\\frac\{\\sigma\_\{\\tau\}^\{2\}\}\{2\}\\frac\{a\_\{r\}\}\{b\_\{r\}\(\\dot\{a\}\_\{r\}b\_\{r\}\-a\_\{r\}\\dot\{b\}\_\{r\}\)\}\)u\_\{\\tau\}\(X\_\{\\tau\}\)\\\!\+\\\!\\sigma\_\{\\tau\}\\sqrt\{\\Delta r\}\\xi\\\!\+\\\!O\(\\Delta r^\{3/2\}\)=\(1−λ2Δra˙τaτ\)Xτ−Δr\(1−λ2\)uτ\(Xτ\)\+στΔrξ\+O\(Δr3/2\)\\displaystyle=\(1\-\\frac\{\\lambda^\{2\}\\Delta r\\dot\{a\}\_\{\\tau\}\}\{a\_\{\\tau\}\}\)X\_\{\\tau\}\-\\Delta r\(1\-\\lambda^\{2\}\)u\_\{\\tau\}\(X\_\{\\tau\}\)\+\\sigma\_\{\\tau\}\\sqrt\{\\Delta r\}\\xi\+O\(\\Delta r^\{3/2\}\)whereξ∼𝒩\(0,I\)\\xi\\sim\\mathcal\{N\}\(0,I\), and∇xlog⁡pr\(Xr\)=arur\(Xr\)−a˙rXrbr\(a˙rbr−arb˙r\)\\nabla\_\{x\}\\log p\_\{r\}\(X\_\{r\}\)=\\frac\{a\_\{r\}u\_\{r\}\(X\_\{r\}\)\-\\dot\{a\}\_\{r\}X\_\{r\}\}\{b\_\{r\}\(\\dot\{a\}\_\{r\}b\_\{r\}\-a\_\{r\}\\dot\{b\}\_\{r\}\)\}is the score function of the affine probability path\. For the short forward deterministic segmentr→τr\\to\\tau, we haveXτ=ψr→r\+Δr\(Xr\)=Xr\+Δrur\(Xr\)\+O\(Δr2\)X\_\{\\tau\}=\\psi\_\{r\\to r\+\\Delta r\}\(X\_\{r\}\)=X\_\{r\}\+\\Delta ru\_\{r\}\(X\_\{r\}\)\+O\(\\Delta r^\{2\}\)by discretizing[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)\. Substituting the above expressions for the deterministic segmentr→τr\\to\\tauand the stochastic segmentτ→r\\tau\\to rinto[Equation 7](https://arxiv.org/html/2607.00535#S3.E7)and using the local approximationsuτ\(Xτ\)≈ur\(Xr\)u\_\{\\tau\}\(X\_\{\\tau\}\)\\approx u\_\{r\}\(X\_\{r\}\)andστ≈σr\\sigma\_\{\\tau\}\\approx\\sigma\_\{r\}, we obtain the local\-anchor stochastic flow map

ψ~t→rloc\(Xt\)=Xr−Δrλ2\[a˙τaτXr−ur\(Xr\)\]\+στΔrξ\+O\(Δr3/2\),ξ∼𝒩\(0,I\)\.\\tilde\{\\psi\}\_\{t\\to r\}^\{\\mathrm\{loc\}\}\(X\_\{t\}\)=X\_\{r\}\-\\Delta r\\lambda^\{2\}\[\\frac\{\\dot\{a\}\_\{\\tau\}\}\{a\_\{\\tau\}\}X\_\{r\}\-u\_\{r\}\(X\_\{r\}\)\]\+\\sigma\_\{\\tau\}\\sqrt\{\\Delta r\}\\xi\+O\(\\Delta r^\{3/2\}\),\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\)\.\(10\)This gives a simple closed\-form stochasticizationψ~t→r\\tilde\{\\psi\}\_\{t\\to r\}of the deterministic flow\-map transitionψt→r\\psi\_\{t\\to r\}, and the randomness is controlled byστΔr\\sigma\_\{\\tau\}\\sqrt\{\\Delta r\}\. The following theorem proves theO\(\(Δr\)3/2\)O\(\(\\Delta r\)^\{3/2\}\)error bound for the approximation used in the above local\-anchor derivation\.

###### Theorem 3\.4\(Error of the local\-anchor approximation\)\.

LetX~r⋆\\tilde\{X\}\_\{r\}^\{\\star\}denote the exact sample obtained by running the reverse\-time SDE transitionpr\|τ\(⋅\|Xτ\)p\_\{r\|\\tau\}\(\\cdot\|X\_\{\\tau\}\)fromτ=r\+Δr\\tau=r\+\\Delta rtorr, whereXr=ψt→r\(Xt\)X\_\{r\}=\\psi\_\{t\\to r\}\(X\_\{t\}\)andXτ=ψr→r\+Δr\(Xr\)X\_\{\\tau\}=\\psi\_\{r\\to r\+\\Delta r\}\(X\_\{r\}\)\. LetX~rloc\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}denote the local\-anchor approximationX~rloc=Xr−Δrλ2\[a˙rarXr−ur\(Xr\)\]\+σrΔrξ\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}=X\_\{r\}\-\\Delta r\\lambda^\{2\}\[\\frac\{\\dot\{a\}\_\{r\}\}\{a\_\{r\}\}X\_\{r\}\-u\_\{r\}\(X\_\{r\}\)\]\+\\sigma\_\{r\}\\sqrt\{\\Delta r\}\\xi,ξ∼𝒩\(0,I\)\\xi\\sim\\mathcal\{N\}\(0,I\), whereσr=λ2br\(a˙rbr−arb˙r\)ar\\sigma\_\{r\}=\\lambda\\sqrt\{\\frac\{2b\_\{r\}\(\\dot\{a\}\_\{r\}b\_\{r\}\-a\_\{r\}\\dot\{b\}\_\{r\}\)\}\{a\_\{r\}\}\}\. Assume thatus\(x\)u\_\{s\}\(x\),asa\_\{s\},bsb\_\{s\}, andσs\\sigma\_\{s\}are smooth with bounded derivatives in a neighborhood of the trajectory\. Then the local\-anchor approximation satisfies the strong error bound

‖X~r⋆−X~rloc‖L2=O\(\(Δr\)3/2\)\.\\\|\\tilde\{X\}\_\{r\}^\{\\star\}\-\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}\\\|\_\{L^\{2\}\}=O\(\(\\Delta r\)^\{3/2\}\)\.\(11\)See[subsection B\.2](https://arxiv.org/html/2607.00535#A2.SS2)for the proof\.

#### 3\.2\.2Endpoint Anchor

The endpoint anchor choosesτ=1\\tau=1\. In this case, the anchor state is obtained by transporting the deterministic endpoint to the data endpoint:X1=ψr→1\(Xr\)=X\_\{1\}=\\psi\_\{r\\to 1\}\(X\_\{r\}\)=Xr=ψt→r\(Xt\)X\_\{r\}=\\psi\_\{t\\to r\}\(X\_\{t\}\)\. For the affine probability path, we haveXr=arX1\+brX0X\_\{r\}=a\_\{r\}X\_\{1\}\+b\_\{r\}X\_\{0\}, whereX0∼𝒩\(0,I\)X\_\{0\}\\sim\\mathcal\{N\}\(0,I\)is independent ofX1X\_\{1\}\. Therefore, the conditional distributionpr\|1p\_\{r\|1\}has the closed\-form Gaussian expression

pr\|1\(⋅\|X1\)=𝒩\(arX1,br2I\)\.p\_\{r\|1\}\(\\cdot\|X\_\{1\}\)=\\mathcal\{N\}\(a\_\{r\}X\_\{1\},b\_\{r\}^\{2\}I\)\.\(12\)Thus, the endpoint\-anchor stochastic flow map is given by

ψ~t→rend\(Xt\)=X~r=arX1\+brξ=arψr→1∘ψt→r\(Xt\)\+brξ,ξ∼𝒩\(0,I\)\.\\tilde\{\\psi\}\_\{t\\to r\}^\{\\mathrm\{end\}\}\(X\_\{t\}\)=\\tilde\{X\}\_\{r\}=a\_\{r\}X\_\{1\}\+b\_\{r\}\\xi=a\_\{r\}\\psi\_\{r\\to 1\}\\circ\\psi\_\{t\\to r\}\(X\_\{t\}\)\+b\_\{r\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\)\.\(13\)This gives a simple stochasticization ofψt→r\\psi\_\{t\\to r\}through the endpoint anchor\. Compared with the local anchor, the endpoint anchor performs resampling from the endpoint conditional distributionpr\|1p\_\{r\|1\}rather than from a short local interval, and therefore typically injects stronger randomness\. However, this choice relies on the accuracy of the long\-range mapψr→1\\psi\_\{r\\to 1\}\. This requirement may be restrictive in practice, especially for two\-time flow\-map models such as MeanFlow, which parameterize general transitionsψt→r\\psi\_\{t\\to r\}but are not specifically optimized as dedicated endpoint mapsψt→1\\psi\_\{t\\to 1\}\.

![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_pickscore/prompt_106_grid.jpg)Figure 5:sCM post\-training with PickScore reward\.Qualitative comparison between the base sCM generator and the FlowMap\-GRPO post\-trained sCM model on the prompt “Raw Photo, masterpiece award winning close up of a massive timber wolf standing in the moonlight in the dark in the darkness\.” The model is post\-trained using PickScore reward withK=4K=4stochastic flow\-map steps during training, and evaluated under multiple inference step budgets to examine cross\-step generalization\.
#### 3\.2\.3Intermediate Anchor

More generally, we may choose an intermediate anchor timer<τ<1r<\\tau<1\. A natural but incorrect idea is to mimic the endpoint anchor and use the affine path to construct a closed\-form transition\. SinceXτ=aτX1\+bτX0X\_\{\\tau\}=a\_\{\\tau\}X\_\{1\}\+b\_\{\\tau\}X\_\{0\}andXr=arX1\+brX0X\_\{r\}=a\_\{r\}X\_\{1\}\+b\_\{r\}X\_\{0\}, one may sample a fresh noise variableξ∼𝒩\(0,I\)\\xi\\sim\\mathcal\{N\}\(0,I\), estimateX1X\_\{1\}byX^1=Xτ−bτξaτ\\hat\{X\}\_\{1\}=\\frac\{X\_\{\\tau\}\-b\_\{\\tau\}\\xi\}\{a\_\{\\tau\}\}, and then form

ψ~t→r\(Xt\)=X~rnaive=arX^1\+brξ=araτXτ\+\(br−arbτaτ\)ξ,ξ∼𝒩\(0,I\)\.\\tilde\{\\psi\}\_\{t\\to r\}\(X\_\{t\}\)=\\tilde\{X\}\_\{r\}^\{\\mathrm\{naive\}\}=a\_\{r\}\\hat\{X\}\_\{1\}\+b\_\{r\}\\xi=\\frac\{a\_\{r\}\}\{a\_\{\\tau\}\}X\_\{\\tau\}\+\(b\_\{r\}\-\\frac\{a\_\{r\}b\_\{\\tau\}\}\{a\_\{\\tau\}\}\)\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\)\.\(14\)However, this approximation treats the noise variableX0X\_\{0\}as an unconditional Gaussian even after conditioning onXτX\_\{\\tau\}\. Indeed, the exact conditional distribution should sampleX0X\_\{0\}from its posterior givenXτX\_\{\\tau\}:p0\|τ\(x0\|xτ\)∝p0\(x0\)p1\(xτ−bτx0aτ\)p\_\{0\|\\tau\}\(x\_\{0\}\|x\_\{\\tau\}\)\\propto p\_\{0\}\(x\_\{0\}\)p\_\{1\}\(\\frac\{x\_\{\\tau\}\-b\_\{\\tau\}x\_\{0\}\}\{a\_\{\\tau\}\}\), up to the constant Jacobian factor\. Therefore, the exact conditional transition is

Xr\|Xτ=araτXτ\+\(br−arbτaτ\)X0,X0∼p0\|τ\(⋅\|Xτ\)\.X\_\{r\}\|X\_\{\\tau\}=\\frac\{a\_\{r\}\}\{a\_\{\\tau\}\}X\_\{\\tau\}\+\(b\_\{r\}\-\\frac\{a\_\{r\}b\_\{\\tau\}\}\{a\_\{\\tau\}\}\)X\_\{0\},\\qquad X\_\{0\}\\sim p\_\{0\|\\tau\}\(\\cdot\|X\_\{\\tau\}\)\.\(15\)Equivalently,pr\|τ\(xr\|xτ\)=∫δ\(xr−araτxτ−\(br−arbτaτ\)x0\)p0\|τ\(x0\|xτ\)𝑑x0p\_\{r\|\\tau\}\(x\_\{r\}\|x\_\{\\tau\}\)=\\int\\delta\(x\_\{r\}\-\\frac\{a\_\{r\}\}\{a\_\{\\tau\}\}x\_\{\\tau\}\-\(b\_\{r\}\-\\frac\{a\_\{r\}b\_\{\\tau\}\}\{a\_\{\\tau\}\}\)x\_\{0\}\)p\_\{0\|\\tau\}\(x\_\{0\}\|x\_\{\\tau\}\)dx\_\{0\}\. The naive Gaussian formula above replacesp0\|τ\(⋅\|Xτ\)p\_\{0\|\\tau\}\(\\cdot\|X\_\{\\tau\}\)with the unconditional priorp0=𝒩\(0,I\)p\_\{0\}=\\mathcal\{N\}\(0,I\), thereby ignoring the posterior constraint imposed by the anchor state\. This missing posterior correction can be substantial for intermediate anchors, becauseXτX\_\{\\tau\}contains information about both the data variableX1X\_\{1\}and the noise variableX0X\_\{0\}\. In our experiments, this approximation leads to large errors and unstable RL post\-training\.

### 3\.3Flow\-Map GRPO

With ASFMC, a deterministic flow map is converted into a stochastic policy while preserving the original probability path\. For a flow\-map step fromtjt\_\{j\}torjr\_\{j\}, we define the state assj=\(c,tj,rj,xtj\)s\_\{j\}=\(c,t\_\{j\},r\_\{j\},x\_\{t\_\{j\}\}\)and the action as the next latent stateaj=xrja\_\{j\}=x\_\{r\_\{j\}\}\. The policy is induced by the stochastic flow mapπθ\(aj\|sj\)=pθ\(xrj\|xtj,c\)=pθ\(aj=ψ~tj→rjθ\(xtj\|c\)\|sj\)\\pi\_\{\\theta\}\(a\_\{j\}\|s\_\{j\}\)=p\_\{\\theta\}\(x\_\{r\_\{j\}\}\|x\_\{t\_\{j\}\},c\)=p\_\{\\theta\}\(a\_\{j\}=\\tilde\{\\psi\}\_\{t\_\{j\}\\to r\_\{j\}\}^\{\\theta\}\(x\_\{t\_\{j\}\}\|c\)\|s\_\{j\}\)\. Given the sampled actionaj=xrja\_\{j\}=x\_\{r\_\{j\}\}, the MDP transition is deterministicP\(sj\+1\|sj,aj\)=δ\(c,rj,rj\+1,aj\)\(sj\+1\)P\(s\_\{j\+1\}\|s\_\{j\},a\_\{j\}\)=\\delta\_\{\(c,r\_\{j\},r\_\{j\+1\},a\_\{j\}\)\}\(s\_\{j\+1\}\), where the next state uses the sampled latentaja\_\{j\}as its input state\. Therefore, ASFMC turns flow\-map sampling into a stochastic multi\-step MDP as in[subsection 2\.2](https://arxiv.org/html/2607.00535#S2.SS2), enabling policy\-gradient post\-training\. In this work, we instantiate the RL objective with GRPO, while noting that ASFMC is a stochasticization mechanism and can be combined with other RL algorithms\.

Flow\-map models can be broadly divided into two categories\. Single\-time flow maps learn a fixed endpoint mapψt→1\\psi\_\{t\\to 1\}, while two\-time flow maps directly learn transitionsψt→r\\psi\_\{t\\to r\}between arbitrary time pairs\. We summarize these parameterizations in[subsection A\.2](https://arxiv.org/html/2607.00535#A1.SS2)\. We next discuss how to apply Flow\-Map GRPO to these two settings\.

#### 3\.3\.1Two\-Time Flow Maps

Two\-time flow maps can be naturally combined with both the local anchor and the endpoint anchor\. We take MeanFlowGenget al\.\([2025a](https://arxiv.org/html/2607.00535#bib.bib76)\)as an example, which parameterizes the transition by an average velocityut→rθu^\{\\theta\}\_\{t\\to r\}:ψt→rθ\(xt\|c\)=xt\+\(r−t\)ut→rθ\(xt\|c\)\\psi^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\|c\)=x\_\{t\}\+\(r\-t\)u^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\|c\)\. We split the interval\[0,1\]\[0,1\]intoK\+1K\+1segments\. The last segment\[rend,1\]\[r\_\{\\mathrm\{end\}\},1\]is kept deterministic and is not optimized during RL post\-training, similar to Flow\-GRPO\. The remaining stochastic transitions are defined on\[0,rend\]\[0,r\_\{\\mathrm\{end\}\}\]\. Specifically, we first define uniform reference pointsr^i=iKrend\\hat\{r\}\_\{i\}=\\frac\{i\}\{K\}r\_\{\\mathrm\{end\}\}fori=0,…,Ki=0,\\ldots,K, and then optionally apply an exponential time shiftri=rendexp⁡\(γi/K\)−1exp⁡\(γ\)−1r\_\{i\}=r\_\{\\mathrm\{end\}\}\\frac\{\\exp\(\\gamma i/K\)\-1\}\{\\exp\(\\gamma\)\-1\},i=0,…,Ki=0,\\ldots,K, whereγ\\gammacontrols the concentration of time points\.

For each stochastic segmentrj−1→rjr\_\{j\-1\}\\to r\_\{j\}, we instantiate the policy density using the ASFMC\-randomized flow mapaj=xrj∼πθ\(⋅∣sj\)=ψ~rj−1→rjθ\(xrj−1\|c\)a\_\{j\}=x\_\{r\_\{j\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{j\}\)=\\tilde\{\\psi\}^\{\\theta\}\_\{r\_\{j\-1\}\\to r\_\{j\}\}\(x\_\{r\_\{j\-1\}\}\|c\), which can be computed by either the local\-anchor update in[Equation 10](https://arxiv.org/html/2607.00535#S3.E10)or the endpoint\-anchor update in[Equation 13](https://arxiv.org/html/2607.00535#S3.E13)

ajloc\\displaystyle a\_\{j\}^\{loc\}=x^rj−Δrλ2\[a˙τaτx^rj−ur→rθ\(x^rj\)\]\+στΔrξ\\displaystyle=\\hat\{x\}\_\{r\_\{j\}\}\-\\Delta r\\lambda^\{2\}\[\\frac\{\\dot\{a\}\_\{\\tau\}\}\{a\_\{\\tau\}\}\\hat\{x\}\_\{r\_\{j\}\}\-u^\{\\theta\}\_\{r\\to r\}\(\\hat\{x\}\_\{r\_\{j\}\}\)\]\+\\sigma\_\{\\tau\}\\sqrt\{\\Delta r\}\\xi\(16\)ajend\\displaystyle a\_\{j\}^\{end\}=arψr→1θ\(x^rj\)\+brξ,\\displaystyle=a\_\{r\}\\psi^\{\\theta\}\_\{r\\to 1\}\(\\hat\{x\}\_\{r\_\{j\}\}\)\+b\_\{r\}\\xi,whereξ∼𝒩\(0,I\)\\xi\\sim\\mathcal\{N\}\(0,I\)andx^rj=\(rj−rj−1\)urj−1→rjθ\(xrj−1\|c\)\+xrj−1\\hat\{x\}\_\{r\_\{j\}\}=\(r\_\{j\}\-r\_\{j\-1\}\)u^\{\\theta\}\_\{r\_\{j\-1\}\\to r\_\{j\}\}\(x\_\{r\_\{j\-1\}\}\|c\)\+x\_\{r\_\{j\-1\}\}\. After theKKstochastic policy steps, the final deterministic segment mapsxrendx\_\{r\_\{\\mathrm\{end\}\}\}to the data endpointx1=ψrend→1θ\(xrend\|c\)x\_\{1\}=\\psi^\{\\theta\}\_\{r\_\{\\mathrm\{end\}\}\\to 1\}\(x\_\{r\_\{\\mathrm\{end\}\}\}\|c\), and the reward is evaluated only at the final sample,R\(x1\|c\)R\(x\_\{1\}\|c\)\. Given a promptcc, we sample a group ofGGtrajectories\{τi\}i=1G\\\{\\tau^\{i\}\\\}\_\{i=1\}^\{G\}from the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}and compute the group\-normalized advantageA^i=R\(x1i\|c\)−mean⁡\(R\(x1j\|c\)j=1G\)std⁡\(R\(x1j\|c\)j=1G\)\\hat\{A\}^\{i\}=\\frac\{R\(x^\{i\}\_\{1\}\|c\)\-\\operatorname\{mean\}\(\{R\(x^\{j\}\_\{1\}\|c\)\}\_\{j=1\}^\{G\}\)\}\{\\operatorname\{std\}\(\{R\(x^\{j\}\_\{1\}\|c\)\}\_\{j=1\}^\{G\}\)\}\. We optimize the flow\-map policy with the GRPO objective

𝒥FM\-GRPO\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\text\{FM\-GRPO\}\}\(\\theta\)=𝔼c,τii=1G∼πθold\[1G∑i=1G1K∑j=1K\(min\(ρji\(θ\)A^i,clip\(ρji\(θ\),1−ϵ,1\+ϵ\)A^i\)\\displaystyle=\\mathbb\{E\}\_\{c,\{\\tau^\{i\}\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}\(\\min\(\\rho^\{i\}\_\{j\}\(\\theta\)\\hat\{A\}^\{i\},\\operatorname\{clip\}\(\\rho^\{i\}\_\{j\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}^\{i\}\)\(17\)−βDKL\(πθ\(⋅∣sji\)\|πref\(⋅∣sji\)\)\)\],\\displaystyle\\qquad\\qquad\\qquad\\qquad\-\\beta D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\(\\cdot\\mid s^\{i\}\_\{j\}\)\|\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s^\{i\}\_\{j\}\)\)\)\],wheresji=\(c,rj−1,rj,xrj−1i\)s^\{i\}\_\{j\}=\(c,r\_\{j\-1\},r\_\{j\},x^\{i\}\_\{r\_\{j\-1\}\}\)andρji\(θ\)=πθ\(xrji∣sji\)πθold\(xrji∣sji\)\\rho^\{i\}\_\{j\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(x^\{i\}\_\{r\_\{j\}\}\\mid s^\{i\}\_\{j\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(x^\{i\}\_\{r\_\{j\}\}\\mid s^\{i\}\_\{j\}\)\}\. The final deterministic segmentrend→1r\_\{\\mathrm\{end\}\}\\to 1is used only to obtain the reward sample and is excluded from the policy\-ratio and KL terms\.

#### 3\.3\.2One\-Time Flow Maps

One\-time flow maps learn a fixed endpoint mapψt→1θ\\psi^\{\\theta\}\_\{t\\to 1\}rather than arbitrary transitionsψt→rθ\\psi^\{\\theta\}\_\{t\\to r\}\. We take sCM as an example, where the model directly predicts the endpoint stateX1=ψt→1θ\(Xt\|c\)X\_\{1\}=\\psi^\{\\theta\}\_\{t\\to 1\}\(X\_\{t\}\|c\)\. Following the sCM time parameterization, we uniformly split the angular interval\[0,π/2\]\[0,\\pi/2\]intoK\+1K\+1segments, and use the corresponding cosine time points\{rj\}j=0K\+1\\\{r\_\{j\}\\\}\_\{j=0\}^\{K\+1\}\. The firstKKsegments are used as stochastic policy steps for RL post\-training, while the last segment\[rK,rK\+1\]\[r\_\{K\},r\_\{K\+1\}\]is kept deterministic and used only to obtain the final sample\. The main difference from two\-time flow maps is that one\-time flow maps cannot directly use the local anchor, since the local anchor requires a short transitionψr→r\+Δr\\psi\_\{r\\to r\+\\Delta r\}or the instantaneous velocityuru\_\{r\}, which is not provided by a fixed\-endpoint model\.

Fortunately, the endpoint anchor is directly compatible with one\-time flow maps\. For each stochastic segmentrj−1→rjr\_\{j\-1\}\\to r\_\{j\}, we first predict the endpoint stateX1=ψrj−1→1θ\(Xrj−1\|c\)X\_\{1\}=\\psi^\{\\theta\}\_\{r\_\{j\-1\}\\to 1\}\(X\_\{r\_\{j\-1\}\}\|c\), and then sample the next latent state from the affine conditional path:

ajend\\displaystyle a\_\{j\}^\{end\}=arjψrj−1→1θ\(xrj−1\|c\)\+brjξ,\\displaystyle=a\_\{r\_\{j\}\}\\psi^\{\\theta\}\_\{r\_\{j\-1\}\\to 1\}\(x\_\{r\_\{j\-1\}\}\|c\)\+b\_\{r\_\{j\}\}\\xi,\(18\)ξ\\displaystyle\\xi∼𝒩\(0,I\)\.\\displaystyle\\sim\\mathcal\{N\}\(0,I\)\.Thus, the policy density ispθ\(xrj∣xrj−1\|c\)=𝒩\(arjψrj−1→1θ\(xrj−1\|c\),brj2I\)p\_\{\\theta\}\(x\_\{r\_\{j\}\}\\mid x\_\{r\_\{j\-1\}\}\|c\)=\\mathcal\{N\}\(a\_\{r\_\{j\}\}\\psi^\{\\theta\}\_\{r\_\{j\-1\}\\to 1\}\(x\_\{r\_\{j\-1\}\}\|c\),b\_\{r\_\{j\}\}^\{2\}I\)\. Given the sampled actionaj=xrja\_\{j\}=x\_\{r\_\{j\}\}, the MDP transition remains deterministic as before:P\(sj\+1∣sj,aj\)=δ\(c,rj,rj\+1,aj\)\(sj\+1\)P\(s\_\{j\+1\}\\mid s\_\{j\},a\_\{j\}\)=\\delta\_\{\(c,r\_\{j\},r\_\{j\+1\},a\_\{j\}\)\}\(s\_\{j\+1\}\)\. After theKKstochastic endpoint\-anchor steps, we use the deterministic endpoint map to obtain the final samplex1=ψrend→1θ\(xrend\|c\)x\_\{1\}=\\psi^\{\\theta\}\_\{r\_\{\\mathrm\{end\}\}\\to 1\}\(x\_\{r\_\{\\mathrm\{end\}\}\}\|c\), and evaluate the rewardR\(x1\|c\)R\(x\_\{1\}\|c\)\. The GRPO objective is the same as in the two\-time case, with the policy ratio computed using the one\-time endpoint\-anchor policy density above\.

The detailed algorithm for[subsubsection 3\.3\.1](https://arxiv.org/html/2607.00535#S3.SS3.SSS1)and[subsubsection 3\.3\.2](https://arxiv.org/html/2607.00535#S3.SS3.SSS2)is given in Algorithm[subsection 4\.1](https://arxiv.org/html/2607.00535#S4.SS1)\. This algorithm enables GRPO post\-training of flow\-map\-based few\-step generators and can be combined with existing GRPO variants such as MixGRPO\. During inference, the sampling procedure follows the standard one\-time and two\-time flow\-map samplers, and we summarize these two sampling parameterizations in Algorithm[subsection A\.2](https://arxiv.org/html/2607.00535#A1.SS2)\.

## 4Experiment

### 4\.1Text\-to\-Image Generation

Algorithm 1Flow\-Map GRPO with ASFMC

1:prompt dataset

𝒟\\mathcal\{D\}, group size

GG, stochastic steps

KK, time grid

\{rj\}j=0K\\\{r\_\{j\}\\\}\_\{j=0\}^\{K\}with

rK=rendr\_\{K\}=r\_\{\\mathrm\{end\}\}, policy

πθ\\pi\_\{\\theta\}, reference policy

πref\\pi\_\{\\mathrm\{ref\}\}, reward model

RR, clip range

ϵ\\epsilon, KL weight

β\\beta, learning rate

η\\eta
2:repeat

3:Sample prompt

c∼𝒟c\\sim\\mathcal\{D\}and set

θold←θ\\theta\_\{\\mathrm\{old\}\}\\leftarrow\\theta
4:for

i=1,…,Gi=1,\\ldots,Gdo

5:Sample initial latent

xr0i∼p0x^\{i\}\_\{r\_\{0\}\}\\sim p\_\{0\}
6:for

j=1,…,Kj=1,\\ldots,Kdo

7:

sji←\(c,rj−1,rj,xrj−1i\)s^\{i\}\_\{j\}\\leftarrow\(c,r\_\{j\-1\},r\_\{j\},x^\{i\}\_\{r\_\{j\-1\}\}\)
8:if

type=two\-time\\mathrm\{type\}=\\mathrm\{two\\text\{\-\}time\}then

9:Compute

x^rji←ψrj−1→rjθold\(xrj−1i\|c\)\\hat\{x\}^\{i\}\_\{r\_\{j\}\}\\leftarrow\\psi^\{\\theta\_\{\\mathrm\{old\}\}\}\_\{r\_\{j\-1\}\\\!\\to\\\!r\_\{j\}\}\(x^\{i\}\_\{r\_\{j\-1\}\}\|c\)
10:if

α=loc\\alpha=\\mathrm\{loc\}then

11:Sample

ajia^\{i\}\_\{j\}with local or end\-

12:point anchor using eq\.[16](https://arxiv.org/html/2607.00535#S3.E16)

13:endif

14:else

15:Compute

x1i←ψrj−1→1θold\(xrj−1i\|c\)x^\{i\}\_\{1\}\\leftarrow\\psi^\{\\theta\_\{\\mathrm\{old\}\}\}\_\{r\_\{j\-1\}\\to 1\}\(x^\{i\}\_\{r\_\{j\-1\}\}\|c\)
16:Sample

ajia^\{i\}\_\{j\}using anchor eq\.[18](https://arxiv.org/html/2607.00535#S3.E18)

17:endif

18:Update

sj\+1i←\(c,rj,rj\+1,xrji\)s^\{i\}\_\{j\+1\}\\leftarrow\(c,r\_\{j\},r\_\{j\+1\},x^\{i\}\_\{r\_\{j\}\}\)
19:endfor

20:Compute

x1i←ψrend→1θold\(xrendi\|c\)x^\{i\}\_\{1\}\\leftarrow\\psi^\{\\theta\_\{\\mathrm\{old\}\}\}\_\{r\_\{\\mathrm\{end\}\}\\to 1\}\(x^\{i\}\_\{r\_\{\\mathrm\{end\}\}\}\|c\)
21:Compute reward

Ri←R\(x1i\|c\)R^\{i\}\\leftarrow R\(x^\{i\}\_\{1\}\|c\)
22:endfor

23:Compute

𝒥\(θ\)\\mathcal\{J\}\(\\theta\)using eq\.[17](https://arxiv.org/html/2607.00535#S3.E17)\.

24:

θ←θ\+η∇θ𝒥\(θ\)\\theta\\leftarrow\\theta\+\\eta\\nabla\_\{\\theta\}\\mathcal\{J\}\(\\theta\)
25:untilconvergence

We first evaluate Flow\-Map GRPO for text\-to\-image post\-training using the official T2I\-Distill checkpoints released byPuet al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib75)\), which are distilled from FLUX\.1\-lite and include both MeanFlow and sCM few\-step generators\. Starting from these pretrained flow\-map models, we freeze the base generator and train only LoRA adapters\. For each backbone, we perform reward\-specific post\-training with PickScore, OCR accuracy, and GenEval as the final\-image reward, respectively\. Each reward defines a separate training run and produces a separate LoRA adapter; across these runs, we keep the resolution, guidance scale, optimizer, LoRA architecture, rollout group size, and ASFMC stochasticization hyperparameters fixed\. Thus, the only changes across the PickScore, OCR, and GenEval runs are the reward function and the corresponding prompt set\. Details are shown in[subsection C\.1](https://arxiv.org/html/2607.00535#A3.SS1)\.

During RL post\-training, all models useK=4K=4stochastic ASFMC transitions\. The reward is computed only from the final decoded image, while the likelihood\-ratio and KL terms are evaluated on theKKstochastic policy transitions\. The final deterministic endpoint transition is used only to obtain the reward sample and is excluded from the policy\-ratio and KL terms\. At evaluation time, we follow the standard deterministic sampling procedure of each flow\-map parameterization, as summarized in[subsection A\.2](https://arxiv.org/html/2607.00535#A1.SS2)\.

We report task\-level, perceptual, and preference\-based metrics, including GenEval, OCR accuracy, PickScore, aesthetic score, DQA, ImageReward, and UniReward, followingLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\)\. Since MeanFlow and sCM use different flow\-map parameterizations and native sampling schedules, we present their results in separate tables: MeanFlow results are shown in[Table 1](https://arxiv.org/html/2607.00535#S4.T1),[Table 2](https://arxiv.org/html/2607.00535#S4.T2), and[Table 3](https://arxiv.org/html/2607.00535#S4.T3), while sCM results are shown in[Table 4](https://arxiv.org/html/2607.00535#S4.T4),[Table 5](https://arxiv.org/html/2607.00535#S4.T5), and[Table 6](https://arxiv.org/html/2607.00535#S4.T6)\. Within each table, results are grouped by the number of inference steps\. Flow\-Map GRPO LoRA post\-training is performed withK=4K=4, while evaluation is reported across multiple sampling steps to assess generalization across inference step counts\. For each sampling budget, the pretrained base checkpoint and its Flow\-Map GRPO LoRA counterpart are evaluated with the same sampler, inference schedule, guidance scale, resolution, and random seed\. This isolates the effect of RL post\-training within each flow\-map parameterization\. Additional qualitative comparisons are provided in[Appendix D](https://arxiv.org/html/2607.00535#A4)\.

Table 1:Text\-to\-image post\-training results for the MeanFlow OCR checkpoint from T2I\-Distill under uniform sampling steps\.OCR Acc\.is the primary optimization target and denotes OCR evaluation on the OCR/text\-rendering task\. DrawBench is used as a general text\-to\-image prompt benchmark to evaluate whether OCR\-oriented post\-training preserves broader image\-quality, alignment, and reward metrics\.↑\\uparrowindicates higher is better\.Table 2:Text\-to\-image post\-training results for MeanFlow checkpoints from T2I\-Distill under uniform sampling steps\.PickScore Taskis the primary optimization target and denotes the standalone PickScore evaluation\. DrawBench is used as a general text\-to\-image prompt benchmark to evaluate whether PickScore\-oriented post\-training transfers to broader image\-quality, alignment, and reward metrics\.↑\\uparrowindicates higher is better\.Table 3:Text\-to\-image post\-training results for the MeanFlow GenEval checkpoint from T2I\-Distill under uniform sampling steps\.GenEvalis the primary optimization target and denotes the official full GenEval evaluation\. DrawBench is used as a general text\-to\-image prompt benchmark to evaluate whether GenEval\-oriented post\-training preserves broader image\-quality, alignment, and reward metrics\.↑\\uparrowindicates higher is better\.Table 4:Text\-to\-image post\-training results for the sCM OCR checkpoint from T2I\-Distill under uniform sampling steps\.OCR Acc\.is the primary optimization target and denotes OCR evaluation on the OCR/text\-rendering task\. DrawBench is used as a general text\-to\-image prompt benchmark to evaluate whether OCR\-oriented post\-training preserves broader image\-quality, alignment, and reward metrics\.↑\\uparrowindicates higher is better\.Table 5:Text\-to\-image post\-training results for sCM checkpoints from T2I\-Distill under uniform sampling steps\.PickScore Taskis the primary optimization target and denotes the standalone PickScore evaluation\. DrawBench is used as a general text\-to\-image prompt benchmark to evaluate whether PickScore\-oriented post\-training transfers to broader image\-quality, alignment, and reward metrics\.↑\\uparrowindicates higher is better\.Table 6:Text\-to\-image post\-training results for the sCM GenEval checkpoint from T2I\-Distill under uniform sampling steps\.GenEvalis the primary optimization target and denotes the full GenEval evaluation\. DrawBench is used as a general text\-to\-image prompt benchmark to evaluate whether GenEval\-oriented post\-training preserves broader image\-quality, alignment, and reward metrics\.↑\\uparrowindicates higher is better\.

## 5Related Work

##### Continuous\-time generative models\.

Diffusion models and continuous\-time flow\-based models have become a central framework for high\-quality generative modeling\. Diffusion models define a stochastic noising process and learn to reverse it through denoising or score estimation\(Hoet al\.,[2020](https://arxiv.org/html/2607.00535#bib.bib82); Songet al\.,[2020a](https://arxiv.org/html/2607.00535#bib.bib79);[b](https://arxiv.org/html/2607.00535#bib.bib80)\)\. In parallel, flow\-based formulations such as flow matching and rectified flow learn a time\-dependent velocity field that transports a simple prior distribution to the data distribution through a probability\-flow ODE\(Lipmanet al\.,[2022](https://arxiv.org/html/2607.00535#bib.bib78); Liuet al\.,[2023](https://arxiv.org/html/2607.00535#bib.bib98)\)\. Stochastic interpolants further provide a broad framework that connects deterministic flows and stochastic diffusions through probability paths\(Albergoet al\.,[2025](https://arxiv.org/html/2607.00535#bib.bib136)\)\. These continuous\-time models provide a principled foundation for generative sampling, but typically require many numerical integration steps at inference time\.

##### Few\-step flow\-map generation\.

To reduce sampling cost, recent methods learn long\-range transport operators, or flow maps, that directly map samples between time points\. Consistency models learn mappings from noisy states to clean data and enable one\-step or few\-step generation through self\-consistency constraints\(Songet al\.,[2023](https://arxiv.org/html/2607.00535#bib.bib102); Song and Dhariwal,[2023](https://arxiv.org/html/2607.00535#bib.bib103); Genget al\.,[2024](https://arxiv.org/html/2607.00535#bib.bib86)\)\. Flow Map Matching provides a mathematical framework for learning two\-time flow maps of an underlying probability\-flow dynamics and connects flow maps with consistency models and progressive distillation\(Boffiet al\.,[2025](https://arxiv.org/html/2607.00535#bib.bib100)\)\. Shortcut models and MeanFlow further parameterize long\-range transitions or average velocities, allowing direct jumps over arbitrary time intervals and enabling efficient few\-step generation\(Franset al\.,[2025](https://arxiv.org/html/2607.00535#bib.bib97); Genget al\.,[2025a](https://arxiv.org/html/2607.00535#bib.bib76);[b](https://arxiv.org/html/2607.00535#bib.bib77)\)\. These methods achieve fast inference by replacing iterative ODE integration with deterministic long\-range mappings\. However, because the learned transitionψt→r\\psi\_\{t\\rightarrow r\}is deterministic, these models do not naturally define stochastic trajectories or transition likelihoods, which makes direct reinforcement learning post\-training nontrivial\.

##### Reinforcement learning for generative models\.

Reinforcement learning has recently been used to align generative models with task\-level rewards, including human preference, aesthetic quality, and downstream evaluation metrics\. DDPO formulates diffusion sampling as a multi\-step Markov decision process and applies policy\-gradient optimization to stochastic denoising trajectories\(Blacket al\.,[2024](https://arxiv.org/html/2607.00535#bib.bib152)\)\. Subsequent methods extend similar ideas to visual generation and flow\-based models\(Xueet al\.,[2025](https://arxiv.org/html/2607.00535#bib.bib153); Liuet al\.,[2026](https://arxiv.org/html/2607.00535#bib.bib150); Liet al\.,[2025](https://arxiv.org/html/2607.00535#bib.bib151)\)\. In particular, Flow\-GRPO introduces stochasticity into velocity\-based flow models by replacing the deterministic probability\-flow ODE with a path\-preserving SDE, thereby obtaining stochastic transitions and likelihood ratios for GRPO optimization\. Relatedly, GLASS Flows\(Holderriethet al\.,[2026b](https://arxiv.org/html/2607.00535#bib.bib137)\)improve stochastic transition sampling for inference\-time alignment of continuous\-time flow and diffusion models\. The main difference from our setting is that these methods target diffusion or velocity\-based flow models, where generation is represented through infinitesimal or finely discretized transitions\. In contrast, few\-step flow\-map models directly parameterize long\-range transportsψt→r\\psi\_\{t\\rightarrow r\}\. For such transitions, the corresponding SDE\-induced transition generally depends on the stochastic path betweenttandrr, and therefore cannot be treated as a simple local Gaussian perturbation of the deterministic flow map\. As a result, stochasticization techniques developed for velocity\-based samplers are not directly applicable to deterministic flow\-map generators without an additional path\-preserving construction\.

##### Stochastic and reward\-aware flow maps\.

Several recent works have explored stochastic variants of flow maps for posterior sampling and reward alignment\. Meta Flow Maps \(MFMs\)\(Potaptchiket al\.,[2026](https://arxiv.org/html/2607.00535#bib.bib139)\)extend consistency models and flow maps into the stochastic regime by training a model to perform one\-step posterior sampling, producing multiple samples fromp\(x1∣xt\)p\(x\_\{1\}\\mid x\_\{t\}\)given an intermediate noisy state\. Diamond Maps\(Holderriethet al\.,[2026a](https://arxiv.org/html/2607.00535#bib.bib138)\)propose stochastic flow\-map models designed for efficient reward alignment, aiming to make adaptability to arbitrary rewards an intrinsic property of the generative model\. Strong Stochastic Flow Maps\(McCallumet al\.,[2026](https://arxiv.org/html/2607.00535#bib.bib140)\)directly generalize deterministic flow maps to the stochastic setting by learning strong solution maps of additive\-noise SDEs, while stochastic few\-step models\(Passaroet al\.,[2026](https://arxiv.org/html/2607.00535#bib.bib142)\)study efficient few\-step sampling from SDE\-defined conditional distributions\. Relatedly, Flow Map Reward Guidance\(Huanget al\.,[2026](https://arxiv.org/html/2607.00535#bib.bib141)\)uses flow maps for training\-free reward guidance from an optimal\-control perspective\.

These methods are closely related in that they also recognize the importance of stochasticity or reward\-aware control for efficient alignment\. However, they address a different regime from ours\. Existing stochastic flow\-map methods typically train or redesign the generative model to be stochastic by construction, for example by learning a native stochastic transition kernel, posterior sampler, or stochastic solution map\. In contrast, our goal is to start from a pretrained deterministic flow map and introduce stochasticity only as a post\-hoc transformation for RL post\-training\.Our ASFMC construction does not change the deterministic flow\-map parameterization\.Instead, it composes the deterministic flow map with anchor\-based conditional resampling, yielding a path\-preserving stochastic policy suitable for likelihood\-ratio\-based GRPO optimization\.

##### Positioning\.

Our work is complementary to native stochastic flow\-map learning\. Rather than training a new stochastic flow\-map model class, we provide a stochasticization mechanism for existing deterministic few\-step generators\. This distinction is important for practical post\-training: ASFMC allows pretrained consistency\-style or MeanFlow\-style models to be converted into stochastic policies without modifying their original training objective or model parameterization\.

## 6Conclusion

We presented Flow\-Map GRPO, a reinforcement learning post\-training framework for deterministic few\-step flow\-map generators\. Our starting point is the observation that existing SDE\-based stochasticization methods, although effective for velocity\-based diffusion and flow models, are intrinsically local and do not directly extend to long\-range flow\-map transitions\. To overcome this limitation, we introduced Anchored Stochastic Flow Map Composition , which converts a deterministic flow map into a stochastic policy by combining deterministic transport with anchor\-based conditional resampling\. This construction preserves the original marginal probability path while providing the trajectory\-level stochasticity needed for policy\-gradient optimization\. Building on ASFMC, we formulated few\-step flow\-map sampling as a stochastic MDP and derived a GRPO\-style objective that applies to both single\-time and two\-time flow\-map parameterizations\. Empirically, we validated our method on few\-step FLUX\-based text\-to\-image generators, showing that RL post\-training can improve pretrained MeanFlow and sCM checkpoints across reward\-based, perceptual, and task\-level metrics\.

Our work suggests that deterministic few\-step generators can benefit from the same reward\-driven post\-training paradigm that has proven effective for diffusion and velocity\-based flow models, provided that stochasticity is introduced in a way that respects the underlying probability path\. We hope this opens a path toward broader RL alignment of fast generative models, including more general flow\-map architectures and other few\-step image, video, and multimodal generators\.

## References

- M\. Albergo, N\. M\. Boffi, and E\. Vanden\-Eijnden \(2025\)Stochastic interpolants: a unifying framework for flows and diffusions\.Journal of Machine Learning Research26\(209\),pp\. 1–80\.Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px1.p1.1)\.
- K\. Black, M\. Janner, Y\. Du, I\. Kostrikov, and S\. Levine \(2024\)Training diffusion models with reinforcement learning\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 4965–4987\.Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p3.1),[§2\.2](https://arxiv.org/html/2607.00535#S2.SS2.p1.9),[§2\.2](https://arxiv.org/html/2607.00535#S2.SS2.p2.2),[§3](https://arxiv.org/html/2607.00535#S3.p1.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px3.p1.3)\.
- N\. M\. Boffi, M\. S\. Albergo, and E\. Vanden\-Eijnden \(2025\)Flow map matching with stochastic interpolants: a mathematical framework for consistency models\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§A\.1](https://arxiv.org/html/2607.00535#A1.SS1.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2607.00535#S1.p3.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px2.p1.1)\.
- Q\. Dao, H\. Phung, B\. Nguyen, and A\. Tran \(2023\)Flow matching in latent space\.arXiv preprint arXiv:2307\.08698\.Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p1.1)\.
- K\. Frans, D\. Hafner, S\. Levine, and P\. Abbeel \(2025\)One step diffusion via shortcut models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.1](https://arxiv.org/html/2607.00535#A1.SS1.SSS0.Px1.p1.2),[§3](https://arxiv.org/html/2607.00535#S3.p1.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Geng, M\. Deng, X\. Bai, J\. Z\. Kolter, and K\. He \(2025a\)Mean flows for one\-step generative modeling\.arXiv preprint arXiv:2505\.13447\.Cited by:[§A\.1](https://arxiv.org/html/2607.00535#A1.SS1.SSS0.Px4.p1.6),[§1](https://arxiv.org/html/2607.00535#S1.p2.1),[§2\.1](https://arxiv.org/html/2607.00535#S2.SS1.p3.9),[§3\.3\.1](https://arxiv.org/html/2607.00535#S3.SS3.SSS1.p1.11),[§3](https://arxiv.org/html/2607.00535#S3.p1.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Geng, Y\. Lu, Z\. Wu, E\. Shechtman, J\. Z\. Kolter, and K\. He \(2025b\)Improved mean flows: on the challenges of fastforward generative models\.arXiv preprint arXiv:2512\.02012\.Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p2.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Geng, A\. Pokle, W\. Luo, J\. Lin, and J\. Z\. Kolter \(2024\)Consistency models made easy\.arXiv preprint arXiv:2406\.14548\.Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p2.1),[§2\.1](https://arxiv.org/html/2607.00535#S2.SS1.p3.9),[§3](https://arxiv.org/html/2607.00535#S3.p1.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Guo, W\. Wang, Z\. Yuan, R\. Cao, K\. Chen, Z\. Chen, Y\. Huo, Y\. Zhang, Y\. Wang, S\. Liu,et al\.\(2025\)Splitmeanflow: interval splitting consistency in few\-step generative modeling\.arXiv preprint arXiv:2507\.16884\.Cited by:[§A\.1](https://arxiv.org/html/2607.00535#A1.SS1.SSS0.Px1.p1.2)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§2\.2](https://arxiv.org/html/2607.00535#S2.SS2.p2.2),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Ho, T\. Salimans, A\. Gritsenko, W\. Chan, M\. Norouzi, and D\. J\. Fleet \(2022\)Video diffusion models\.Advances in neural information processing systems35,pp\. 8633–8646\.Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p1.1)\.
- P\. Holderrieth, D\. Chen, L\. Eyring, I\. Shah, G\. Anantharaman, Y\. He, Z\. Akata, T\. Jaakkola, N\. M\. Boffi, and M\. Simchowitz \(2026a\)Diamond maps: efficient reward alignment via stochastic flow maps\.arXiv preprint arXiv:2602\.05993\.Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px4.p1.1)\.
- P\. Holderrieth, U\. Singer, T\. Jaakkola, R\. T\. Chen, Y\. Lipman, and B\. Karrer \(2026b\)GLASS flows: efficient inference for reward alignment of flow and diffusion models\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px3.p1.3)\.
- J\. Y\. Huang, J\. Lin, S\. Shah, K\. Nair, and N\. M\. Boffi \(2026\)How to guide your flow: few\-step alignment via flow map reward guidance\.arXiv preprint arXiv:2604\.27147\.Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px4.p1.1)\.
- J\. Li, Y\. Cui, T\. Huang, Y\. Ma, C\. Fan, Y\. Cheng, M\. Yang, Z\. Zhong, and L\. Bo \(2025\)Mixgrpo: unlocking flow\-based grpo efficiency with mixed ode\-sde\.arXiv preprint arXiv:2507\.21802\.Cited by:[§2\.2](https://arxiv.org/html/2607.00535#S2.SS2.p1.9),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px3.p1.3)\.
- Z\. Li, Y\. Sun, D\. Chen, J\. He, and B\. Zhu \(2026\)Trajectory consistency for one\-step generation on euler mean flows\.arXiv preprint arXiv:2602\.02571\.Cited by:[§A\.1](https://arxiv.org/html/2607.00535#A1.SS1.p4.1),[Theorem A\.1](https://arxiv.org/html/2607.00535#A1.Thmtheorem1),[§1](https://arxiv.org/html/2607.00535#S1.p2.1),[§2\.1](https://arxiv.org/html/2607.00535#S2.SS1.p3.9)\.
- Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2022\)Flow matching for generative modeling\.arXiv preprint arXiv:2210\.02747\.Cited by:[§2\.1](https://arxiv.org/html/2607.00535#S2.SS1.p1.19),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Lipman, M\. Havasi, P\. Holderrieth, N\. Shaul, M\. Le, B\. Karrer, R\. T\. Chen, D\. Lopez\-Paz, H\. Ben\-Hamu, and I\. Gat \(2024\)Flow matching guide and code\.arXiv preprint arXiv:2412\.06264\.Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p3.1),[§2\.1](https://arxiv.org/html/2607.00535#S2.SS1.p1.18),[§2\.2](https://arxiv.org/html/2607.00535#S2.SS2.p2.2)\.
- J\. Liu, G\. Liu, J\. Liang, Y\. Li, J\. Liu, X\. Wang, P\. Wan, D\. Zhang, and W\. Ouyang \(2026\)Flow\-grpo: training flow matching models via online rl\.Advances in neural information processing systems38,pp\. 40783–40818\.Cited by:[§C\.1](https://arxiv.org/html/2607.00535#A3.SS1.SSS0.Px2.p1.2),[§C\.1](https://arxiv.org/html/2607.00535#A3.SS1.SSS0.Px5.p1.15),[§1](https://arxiv.org/html/2607.00535#S1.p3.1),[§2\.2](https://arxiv.org/html/2607.00535#S2.SS2.p1.9),[§3\.1](https://arxiv.org/html/2607.00535#S3.SS1.p1.1),[§3\.2\.1](https://arxiv.org/html/2607.00535#S3.SS2.SSS1.p1.11),[§3](https://arxiv.org/html/2607.00535#S3.p1.1),[§4\.1](https://arxiv.org/html/2607.00535#S4.SS1.p3.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px3.p1.3)\.
- X\. Liu, C\. Gong, and Q\. Liu \(2023\)Flow straight and fast: learning to generate and transfer data with rectified flow\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.1](https://arxiv.org/html/2607.00535#S2.SS1.p1.19),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px1.p1.1)\.
- S\. McCallum, Z\. W\. Blasingame, T\. Herschell, N\. Rindtorff, A\. Tong, and J\. Foster \(2026\)Strong stochastic flow maps\.arXiv preprint arXiv:2606\.01086\.Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px4.p1.1)\.
- R\. Passaro, Z\. W\. Blasingame, M\. M\. Bronstein, and A\. Tong \(2026\)Stochastic few\-step models\.Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px4.p1.1)\.
- P\. Potaptchik, A\. Saravanan, A\. Mammadov, A\. Prat, M\. S\. Albergo, and Y\. W\. Teh \(2026\)Meta flow maps enable scalable reward alignment\.arXiv preprint arXiv:2601\.14430\.Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px4.p1.1)\.
- Y\. Pu, Y\. Han, Z\. Tang, J\. Tang, F\. Wang, B\. Zhuang, and G\. Huang \(2025\)Few\-step distillation for text\-to\-image generation: a practical guide\.arXiv preprint arXiv:2512\.13006\.Cited by:[§C\.1](https://arxiv.org/html/2607.00535#A3.SS1.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2607.00535#S4.SS1.p1.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§3](https://arxiv.org/html/2607.00535#S3.p1.1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2020a\)Denoising diffusion implicit models\.arXiv preprint arXiv:2010\.02502\.Cited by:[§2\.1](https://arxiv.org/html/2607.00535#S2.SS1.p1.19),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Song, P\. Dhariwal, M\. Chen, and I\. Sutskever \(2023\)Consistency models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2607.00535#S1.p2.1),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Song and P\. Dhariwal \(2023\)Improved techniques for training consistency models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2020b\)Score\-based generative modeling through stochastic differential equations\.arXiv preprint arXiv:2011\.13456\.Cited by:[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Xue, J\. Wu, Y\. Gao, F\. Kong, L\. Zhu, M\. Chen, Z\. Liu, W\. Liu, Q\. Guo, W\. Huang,et al\.\(2025\)Dancegrpo: unleashing grpo on visual generation\.arXiv preprint arXiv:2505\.07818\.Cited by:[§2\.2](https://arxiv.org/html/2607.00535#S2.SS2.p1.9),[§5](https://arxiv.org/html/2607.00535#S5.SS0.SSS0.Px3.p1.3)\.

## Appendix

## Appendix AFlow\-Map\-Based Few\-Step Method

### A\.1Flow Map Learning

As discussed in[subsection 2\.1](https://arxiv.org/html/2607.00535#S2.SS1), flow\-map learning aims to directly learn the long\-range transport mapψt→r\\psi\_\{t\\to r\}associated with an underlying probability flow\. Once such a map is available, one\-step generation can be performed byx1=ψ0→1θ\(x0\)x\_\{1\}=\\psi^\{\\theta\}\_\{0\\to 1\}\(x\_\{0\}\),x0∼p0x\_\{0\}\\sim p\_\{0\}\. This perspective differs from conventional multi\-step generative modeling, which learns an instantaneous velocityutu\_\{t\}or score field∇log⁡pt\\nabla\\log p\_\{t\}and then numerically integrates the corresponding dynamics\. Flow\-map learning instead attempts to amortize this integration into a mapping that performs a long\-range transition in one or a few network evaluations\.

A central difficulty is that there is usually no easily computable closed\-form target flow mapψt→r\(xt\)\\psi\_\{t\\to r\}\(x\_\{t\}\)for direct supervision\. Therefore, one cannot simply train the model with a direct regression objective of the form\|ψt→rθ\(xt\)−ψt→r\(xt\)\|22\|\\psi^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)\-\\psi\_\{t\\to r\}\(x\_\{t\}\)\|\_\{2\}^\{2\}\. In multi\-step methods such as Flow Matching, this issue is circumvented by constructing a conditional probability pathpt\(xt\|x1\)p\_\{t\}\(x\_\{t\}\|x\_\{1\}\)and using an associated instantaneous quantity, such as the conditional velocityut\(xt\|x1\)u\_\{t\}\(x\_\{t\}\|x\_\{1\}\), to regress the marginal velocity:

ℒFM\(θ\)=𝔼t,x1∼p1,xt∼pt\|1\(⋅\|x1\)\[∥uθ\(xt,t\)−ut\(xt\|x1\)∥22\]\.\\mathcal\{L\}\_\{\\mathrm\{FM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\,x\_\{1\}\\sim p\_\{1\},\\,x\_\{t\}\\sim p\_\{t\|1\}\(\\cdot\|x\_\{1\}\)\}\[\\\|u^\{\\theta\}\(x\_\{t\},t\)\-u\_\{t\}\(x\_\{t\}\|x\_\{1\}\)\\\|\_\{2\}^\{2\}\]\.\(19\)The same strategy, however, does not directly extend to flow maps\. Unlike instantaneous velocity fields, long\-range flow maps do not admit a self\-consistent conditional counterpart\.

###### Theorem A\.1\(\(Non\-existence of conditional flow maps; Theorem 4\.1 ofLiet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib74)\)\)\)\.

There exists no conditional flow mapψt→r\(x\|x1\)\\psi\_\{t\\to r\}\(x\|x\_\{1\}\)that simultaneously \(1\) is consistent with the conditional velocityut\(x\|x1\)u\_\{t\}\(x\|x\_\{1\}\)under the flow\-map evolution equation in[Equation 1](https://arxiv.org/html/2607.00535#S2.E1); and \(2\) satisfies the marginal consistency relationψt→r\(x\)=𝔼x1∼pt\(x1\|x\)\[ψt→r\(x\|x1\)\]\\psi\_\{t\\to r\}\(x\)=\\mathbb\{E\}\_\{x\_\{1\}\\sim p\_\{t\}\(x\_\{1\}\|x\)\}\[\\psi\_\{t\\to r\}\(x\|x\_\{1\}\)\]\. Consequently, a self\-consistent conditional cumulative field does not exist\.

Because such a conditional flow map does not exist, flow\-map learning is typically formulated through surrogate objectives based on trajectory consistency\. In particular, the semigroup property of the exact flow map implies

ψt→r\(xt\)=ψs→r\(ψt→s\(xt\)\),t,s,r∈\[0,1\],\\psi\_\{t\\to r\}\(x\_\{t\}\)=\\psi\_\{s\\to r\}\(\\psi\_\{t\\to s\}\(x\_\{t\}\)\),\\qquad t,s,r\\in\[0,1\],\(20\)which motivates the trajectory\-consistency lossℒTC\(θ\)=𝔼t,s,r,xt∼pt\[‖ψt→rθ\(xt\)−ψs→rθ\(ψt→sθ\(xt\)\)‖22\]\\mathcal\{L\}^\{\\mathrm\{TC\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,s,r,\\,x\_\{t\}\\sim p\_\{t\}\}\[\\\|\\psi^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)\-\\psi^\{\\theta\}\_\{s\\to r\}\(\\psi^\{\\theta\}\_\{t\\to s\}\(x\_\{t\}\)\)\\\|\_\{2\}^\{2\}\]\. However, this loss alone does not provide dataset\-level supervision: all terms are generated by the model itself\. Therefore,ℒTC\\mathcal\{L\}^\{\\mathrm\{TC\}\}can enforce internal consistency of the learned maps, but it cannot by itself guarantee thatψ0→1θ\\psi^\{\\theta\}\_\{0\\to 1\}pushesp0p\_\{0\}to the desired target distributionp1p\_\{1\}\.

Existing flow\-map\-based generative models can therefore be broadly organized into two categories according to how they obtain supervision for long\-range maps:progressive distillationmethods andderivative\-basedmethodsLiet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib74)\)\. The former progressively transfers supervision from short\-range transitions to longer\-range maps, while the latter derives differential identities that connect long\-range maps to instantaneous velocity or score fields\. We first summarize these two families and then discuss two representative derivative\-based models used in our experiments:sCMandMeanFlow\.

##### Progressive distillation methods\.

Progressive distillation methods learn short\-range transitions first and then extend them to longer intervals through consistency relations or teacher\-student distillation\. Representative examples include ShortCutFranset al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib97)\), the PSD variant of Flow Map MatchingBoffiet al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib100)\), and SplitMeanFlowGuoet al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib143)\)\. These methods usually start from a short\-step map that is close to an instantaneous velocity model and can therefore be supervised using data\-dependent conditional velocities\. For example, ShortCut first learns a short\-step velocity through

ℒshort\(θ\)=𝔼x1∼p1,,x∼pt\|1\(⋅\|x1\)\[\|ut→t\+dθ\(x\)−ut\(x\|x1\)\|22\],ut\(x\|x1\)=x1−x1−t\.\\mathcal\{L\}\_\{\\mathrm\{short\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\_\{1\}\\sim p\_\{1\},,x\\sim p\_\{t\|1\}\(\\cdot\|x\_\{1\}\)\}\[\|u^\{\\theta\}\_\{t\\to t\+d\}\(x\)\-u\_\{t\}\(x\|x\_\{1\}\)\|\_\{2\}^\{2\}\],\\qquad u\_\{t\}\(x\|x\_\{1\}\)=\\frac\{x\_\{1\}\-x\}\{1\-t\}\.\(21\)It then recursively extends the model to longer intervals by enforcing a consistency relation such as

ℒprog\(θ\)=𝔼xt∼pt\[‖ut→t\+2dθ\(xt\)−12\(ut→t\+dθ\(xt\)\+ut\+d→t\+2dθ\(xt\+d\)\)\|22\],\\mathcal\{L\}\_\{\\mathrm\{prog\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\_\{t\}\\sim p\_\{t\}\}\[\\\|u^\{\\theta\}\_\{t\\to t\+2d\}\(x\_\{t\}\)\-\\frac\{1\}\{2\}\(u^\{\\theta\}\_\{t\\to t\+d\}\(x\_\{t\}\)\+u^\{\\theta\}\_\{t\+d\\to t\+2d\}\(x\_\{t\+d\}\)\)\|\_\{2\}^\{2\}\],\(22\)wherext\+d=xt\+dut→t\+dθ\(xt\)x\_\{t\+d\}=x\_\{t\}\+du^\{\\theta\}\_\{t\\to t\+d\}\(x\_\{t\}\)\. In this way, long\-range maps are learned recursively from shorter model predictions rather than being directly supervised by data\. This makes the training objective relatively simple and avoids high\-order derivatives, but errors in short\-step maps may be propagated or amplified when constructing longer jumps\.

##### Derivative\-based methods\.

Derivative\-based methods supervise long\-range flow maps by relating them to instantaneous velocity or score fields through differential identities implied by trajectory consistency\. For an exact flow map, one has Lagrangia transport equation

ddrψt→r\(x\)=ur\(ψt→r\(x\)\),\\frac\{d\}\{dr\}\\psi\_\{t\\to r\}\(x\)=u\_\{r\}\(\\psi\_\{t\\to r\}\(x\)\),\(23\)and, equivalently, the Eulerian transport equation

∂tψt→r\(x\)\+ut\(x\)⋅∇xψt→r\(x\)=0\.\\partial\_\{t\}\\psi\_\{t\\to r\}\(x\)\+u\_\{t\}\(x\)\\cdot\\nabla\_\{x\}\\psi\_\{t\\to r\}\(x\)=0\.\(24\)These identities make it possible to train flow\-map models using data\-dependent conditional velocities or teacher velocity fields\. Compared with progressive distillation, derivative\-based methods provide a more direct learning signal for non\-infinitesimal transitions\. However, they often require derivatives of the learned map with respect to time or input, such as Jacobian\-vector products \(JVPs\), which can increase memory usage, computational cost, and optimization instability\. Two representative derivative\-based flow\-map methods are sCM and MeanFlow\.

##### sCM\.

Consistency Models can be viewed as endpoint flow\-map models\. Instead of learning a general two\-time transitionψt→r\\psi\_\{t\\to r\}, they learn the endpoint mapψt→1θ\(xt\)\\psi^\{\\theta\}\_\{t\\to 1\}\(x\_\{t\}\), which maps a point at timettdirectly to the data endpoint\. The defining consistency condition states that points on the same probability\-flow trajectory should be mapped to the same endpoint\. That is, forxs=ψt→s\(xt\)x\_\{s\}=\\psi\_\{t\\to s\}\(x\_\{t\}\),ψt→1θ\(xt\)=ψs→1θ\(xs\)\\psi^\{\\theta\}\_\{t\\to 1\}\(x\_\{t\}\)=\\psi^\{\\theta\}\_\{s\\to 1\}\(x\_\{s\}\)\. Continuous\-time consistency models \(sCM\), train this endpoint map by applying the Eulerian transport identity \([Equation 24](https://arxiv.org/html/2607.00535#A1.E24)\) above to the special caser=1r=1\. In practice, this gives a derivative\-based objective of the form

ℒsCM\(θ\)=𝔼t,xt\[λ\(t\)‖∂tψt→1θ\(xt\)\+ut\(xt\)∇ψt→1θ\(xt\)‖22\],\\mathcal\{L\}\_\{\\mathrm\{sCM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,x\_\{t\}\}\[\\lambda\(t\)\\\|\\partial\_\{t\}\\psi^\{\\theta\}\_\{t\\to 1\}\(x\_\{t\}\)\+u\_\{t\}\(x\_\{t\}\)\\nabla\\psi^\{\\theta\}\_\{t\\to 1\}\(x\_\{t\}\)\\\|\_\{2\}^\{2\}\],\(25\)whereutu\_\{t\}denotes the velocity target:ut=utϕu\_\{t\}=u\_\{t\}^\{\\phi\}in the distillation setting, andut=ut\(xt\|x1\)u\_\{t\}=u\_\{t\}\(x\_\{t\}\|x\_\{1\}\)when training from scratch\. The termut\(xt\)⋅∇xψt→1θ\(xt\)u\_\{t\}\(x\_\{t\}\)\\cdot\\nabla\_\{x\}\\psi^\{\\theta\}\_\{t\\to 1\}\(x\_\{t\}\)is computed through a Jacobian\-vector product\. Thus, sCM belongs to the derivative\-based family: it learns an endpoint flow map using the differential transport constraint inherited from trajectory consistency\.

##### MeanFlow\.

MeanFlowGenget al\.\([2025a](https://arxiv.org/html/2607.00535#bib.bib76)\)is another representative derivative\-based flow\-map method\. Unlike sCM, which learns an endpoint mapψt→1\\psi\_\{t\\to 1\}, MeanFlow directly parameterizes two\-time flow maps through an average velocity over a finite interval:ut→r\(xt\)=ψt→r\(xt\)−xtr−tu\_\{t\\to r\}\(x\_\{t\}\)=\\frac\{\\psi\_\{t\\to r\}\(x\_\{t\}\)\-x\_\{t\}\}\{r\-t\}\. Thus, a single evaluation ofu0→1θu^\{\\theta\}\_\{0\\to 1\}gives a one\-step generatorx1=x0\+u0→1θ\(x0\)x\_\{1\}=x\_\{0\}\+u^\{\\theta\}\_\{0\\to 1\}\(x\_\{0\}\),x0∼p0x\_\{0\}\\sim p\_\{0\}\. Substitutingψt→r\(x\)=x\+\(r−t\)ut→r\(x\)\\psi\_\{t\\to r\}\(x\)=x\+\(r\-t\)u\_\{t\\to r\}\(x\)into[Equation 23](https://arxiv.org/html/2607.00535#A1.E23)yields

ut→r\(x\)=ut\(x\)\+\(r−t\)\(∂tut→r\(x\)\+ut\(x\)⋅∇xut→r\(x\)\)\.u\_\{t\\to r\}\(x\)=u\_\{t\}\(x\)\+\(r\-t\)\(\\partial\_\{t\}u\_\{t\\to r\}\(x\)\+u\_\{t\}\(x\)\\cdot\\nabla\_\{x\}u\_\{t\\to r\}\(x\)\)\.\(26\)Under the rectified\-flow path, when training from scratch, the instantaneous velocity targetutu\_\{t\}is given by the data\-dependent conditional velocityut\(xt\|x1\)=x1−xt1−tu\_\{t\}\(x\_\{t\}\|x\_\{1\}\)=\\frac\{x\_\{1\}\-x\_\{t\}\}\{1\-t\}, whereas in the distillation setting, this target is replaced by the teacher velocity fieldutϕ\(xt\)u\_\{t\}^\{\\phi\}\(x\_\{t\}\)\. Therefore, MeanFlow trainsut→rθu^\{\\theta\}\_\{t\\to r\}by substituting the corresponding instantaneous velocity targetutu\_\{t\}into the above identity, yielding the regression objective

ℒMF\(θ\)=𝔼t,r,xt\[‖ut→rθ\(xt\)−sg\[ut\(xt\)\+\(r−t\)\(∂tut→rθ\(xt\)\+ut\(xt\)⋅∇xut→rθ\(xt\)\)\]‖22\]\.\\mathcal\{L\}\_\{\\mathrm\{MF\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,r,x\_\{t\}\}\[\\\|u^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)\-\\mathrm\{sg\}\[u\_\{t\}\(x\_\{t\}\)\+\(r\-t\)\(\\partial\_\{t\}u^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)\+u\_\{t\}\(x\_\{t\}\)\\cdot\\nabla\_\{x\}u^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)\)\]\\\|\_\{2\}^\{2\}\]\.\(27\)wheresgsgdenotes the stop\-gradient operator\.

### A\.2One\-Time and Two\-Time Flow\-Map Parameterizations

The above discussion also highlights an important distinction between one\-time and two\-time flow\-map parameterizations\. Although both are flow\-map\-based few\-step generators, they differ in what map is learned during training and how intermediate sampling steps are constructed at inference time\.

Algorithm 2Sampling with Flow Maps

1:condition

cc, sampling steps

NN, time grid

\{rj\}j=0N\\\{r\_\{j\}\\\}\_\{j=0\}^\{N\}with

r0=0r\_\{0\}=0and

rN=1r\_\{N\}=1, initial latent distribution

p0p\_\{0\}, flow\-map model

ψθ\\psi^\{\\theta\}, path coefficients

\(at,bt\)\(a\_\{t\},b\_\{t\}\)
2:Sample initial latent

xr0∼p0x\_\{r\_\{0\}\}\\sim p\_\{0\}
3:for

j=1,…,Nj=1,\\ldots,Ndo

4:if

type=two\-time\\mathrm\{type\}=\\mathrm\{two\\text\{\-\}time\}then

5:Compute

xrj←ψrj−1→rjθ\(xrj−1\|c\)x\_\{r\_\{j\}\}\\leftarrow\\psi^\{\\theta\}\_\{r\_\{j\-1\}\\to r\_\{j\}\}\(x\_\{r\_\{j\-1\}\}\|c\)
6:else

7:Predict the endpoint

x^1←ψrj−1→1θ\(xrj−1\|c\)\\hat\{x\}\_\{1\}\\leftarrow\\psi^\{\\theta\}\_\{r\_\{j\-1\}\\to 1\}\(x\_\{r\_\{j\-1\}\}\|c\)
8:if

rj<1r\_\{j\}<1then

9:Sample

ξj\\xi\_\{j\}according to the sampler

10:Calculate

xrj←arjx^1\+brjξjx\_\{r\_\{j\}\}\\leftarrow a\_\{r\_\{j\}\}\\hat\{x\}\_\{1\}\+b\_\{r\_\{j\}\}\\xi\_\{j\}
11:else

12:Set final output

xrj←x^1x\_\{r\_\{j\}\}\\leftarrow\\hat\{x\}\_\{1\}
13:endif

14:endif

15:endfor

16:returnfinal sample

xrNx\_\{r\_\{N\}\}

##### One\-time endpoint maps\.

One\-time flow\-map models, such as sCM, learn only the endpoint mapψt→1θ\(xt\)\\psi^\{\\theta\}\_\{t\\to 1\}\(x\_\{t\}\), which maps a state at timettdirectly to the data endpoint\. During training, the model is therefore supervised through endpoint consistency or through the derivative\-based transport constraint obtained by settingr=1r=1in[Equation 24](https://arxiv.org/html/2607.00535#A1.E24)\. In other words, the model is not trained to represent arbitrary transitionsψt→r\\psi\_\{t\\to r\}forr<1r<1\. At sampling time, this means that an intermediate transition fromtttorrcannot be obtained by directly querying a learned mapψt→rθ\\psi^\{\\theta\}\_\{t\\to r\}\. Instead, the model first predicts the endpointx^1=ψt→1θ\(xt\)\\hat\{x\}\_\{1\}=\\psi^\{\\theta\}\_\{t\\to 1\}\(x\_\{t\}\), and then constructs an intermediate state using the prescribed probability path\. For an affine path, this has the formxr=arx^1\+brξx\_\{r\}=a\_\{r\}\\hat\{x\}\_\{1\}\+b\_\{r\}\\xi,ξ∼𝒩\(0,I\)\\xi\\sim\\mathcal\{N\}\(0,I\), or its deterministic counterpart under a fixed sampling rule\. Thus, one\-time models perform few\-step generation by repeatedly predicting the endpoint and projecting back to intermediate noise levels according to the path schedule\. This makes endpoint\-map models simple and efficient, but less flexible for arbitrary finite\-time transitions\.

##### Two\-time flow maps\.

Two\-time flow\-map models, such as MeanFlow, directly parameterize transitions between arbitrary time pairs,ψt→rθ\(xt\)\\psi^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\),0≤t,r≤10\\leq t,r\\leq 1\. For example, MeanFlow represents this transition through the average velocityψt→rθ\(xt\)=xt\+\(r−t\)ut→rθ\(xt\)\\psi^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)=x\_\{t\}\+\(r\-t\)u^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)\. During training, the model is supervised over sampled time pairs\(t,r\)\(t,r\)using the derivative identity derived from the flow\-map transport equation\. At sampling time, a finite\-time transition is obtained directly by evaluating the learned two\-time map:xr=ψt→rθ\(xt\)x\_\{r\}=\\psi^\{\\theta\}\_\{t\\to r\}\(x\_\{t\}\)\. Therefore, two\-time models naturally support arbitrary sampling grids and can take long\-range jumps between any selected time points\.

##### Implications for Flow\-Map GRPO\.

This distinction determines how ASFMC is instantiated during RL post\-training\. For two\-time flow maps, the model already provides access toψt→rθ\\psi^\{\\theta\}\_\{t\\to r\}for arbitrary time pairs, so both local anchors and endpoint anchors can be used to stochasticize a finite\-time transition\. In contrast, for one\-time endpoint maps, the model only providesψt→1θ\\psi^\{\\theta\}\_\{t\\to 1\}and does not directly provide short local transitions such asψr→r\+Δrθ\\psi^\{\\theta\}\_\{r\\to r\+\\Delta r\}\. Therefore, the local\-anchor construction is not directly available for one\-time models, and we use the endpoint anchor to define the stochastic policy transition\.

## Appendix BMissing Proofs

### B\.1Proof of[Theorem 3\.2](https://arxiv.org/html/2607.00535#S3.Thmtheorem2)

##### Theorem[3\.2](https://arxiv.org/html/2607.00535#S3.Thmtheorem2)

\(Long\-range SDE transitions require stochastic integration\) LetXsX\_\{s\}solve the path\-preserving SDE[Equation 5](https://arxiv.org/html/2607.00535#S2.E5)and letψt→r\\psi\_\{t\\to r\}be the deterministic flow map induced by the ODE[Equation 1](https://arxiv.org/html/2607.00535#S2.E1)\. For a non\-infinitesimal intervalt<rt<r, the exact SDE transition is generally not representable as a simple additive Gaussian perturbation of a deterministic function of the flow maps:

Xr=Gt,r\(Xt;ψ,dψ,…\)\+At,rϵ,ϵ∼𝒩\(0,I\),X\_\{r\}=G\_\{t,r\}\(X\_\{t\};\\psi,d\\psi,\.\.\.\)\+A\_\{t,r\}\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(28\)whereAt,rA\_\{t,r\}is independent of the input sample andGt,r\(Xt;ψ,dψ,…\)G\_\{t,r\}\(X\_\{t\};\\psi,d\\psi,\\ldots\)denotes a deterministic functional of the flow\-map familyψ\\psiand its derivatives\. Instead, the exact transition must be expressed by the stochastic integralXr=Xt\+∫trbs\(Xs\)𝑑s\+∫trσs𝑑WsX\_\{r\}=X\_\{t\}\+\\int\_\{t\}^\{r\}b\_\{s\}\(X\_\{s\}\)ds\+\\int\_\{t\}^\{r\}\\sigma\_\{s\}dW\_\{s\}, wherebs\(x\)=us\(x\)\+σs22∇xlog⁡ps\(x\)b\_\{s\}\(x\)=u\_\{s\}\(x\)\+\\frac\{\\sigma\_\{s\}^\{2\}\}\{2\}\\nabla\_\{x\}\\log p\_\{s\}\(x\), except for special linear\-Gaussian dynamics\.

###### Proof\.

It suffices to show that the exact SDE transition is generally non\-Gaussian on an arbitrarily small but finite interval\. Indeed, if a representation of the formXr=Gt,r\(Xt;ψ,dψ,…\)\+At,rϵX\_\{r\}=G\_\{t,r\}\(X\_\{t\};\\psi,d\\psi,\\ldots\)\+A\_\{t,r\}\\epsilon,ϵ∼𝒩\(0,I\)\\epsilon\\sim\\mathcal\{N\}\(0,I\), were valid for general finite\-time transitions, then, conditioned onXt=xX\_\{t\}=x, the law ofXrX\_\{r\}would be Gaussian\. In particular, its third centered moment would have to vanish\. We show that this is not true for a generic nonlinear SDE, even for the special caser=t\+hr=t\+hwithh\>0h\>0\.

We prove the obstruction in one dimension; the multidimensional case follows either by considering one\-dimensional systems or by applying the same argument to scalar projections\. Consider the scalar SDE

dXs=bs\(Xs\)ds\+σsdWs,dX\_\{s\}=b\_\{s\}\(X\_\{s\}\)\\,ds\+\\sigma\_\{s\}\\,dW\_\{s\},\(29\)wherebsb\_\{s\}is smooth andσs\>0\\sigma\_\{s\}\>0is state\-independent\. FixXt=xX\_\{t\}=xand setr=t\+hr=t\+h\. The exact solution satisfies

Xt\+h=x\+∫tt\+hbs\(Xs\)𝑑s\+∫tt\+hσs𝑑Ws\.X\_\{t\+h\}=x\+\\int\_\{t\}^\{t\+h\}b\_\{s\}\(X\_\{s\}\)\\,ds\+\\int\_\{t\}^\{t\+h\}\\sigma\_\{s\}\\,dW\_\{s\}\.\(30\)Thus, the transition depends on the full stochastic trajectory\{Xs\}s∈\[t,t\+h\]\\\{X\_\{s\}\\\}\_\{s\\in\[t,t\+h\]\}\.

Let

μh=𝔼\[Xt\+h∣Xt=x\],κ3\(h\)=𝔼\[\(Xt\+h−μh\)3∣Xt=x\]\\mu\_\{h\}=\\mathbb\{E\}\[X\_\{t\+h\}\\mid X\_\{t\}=x\],\\qquad\\kappa\_\{3\}\(h\)=\\mathbb\{E\}\\left\[\(X\_\{t\+h\}\-\\mu\_\{h\}\)^\{3\}\\mid X\_\{t\}=x\\right\]\(31\)be the conditional mean and third centered moment\. A short\-time expansion of the diffusion transition gives

κ3\(h\)=σt4∂xxbt\(x\)h3\+o\(h3\)\.\\kappa\_\{3\}\(h\)=\\sigma\_\{t\}^\{4\}\\,\\partial\_\{xx\}b\_\{t\}\(x\)\\,h^\{3\}\+o\(h^\{3\}\)\.\(32\)For completeness, we outline the calculation\. Freezing the coefficients at timettand using the local infinitesimal generator

ℒtf\(x\)=bt\(x\)∂xf\(x\)\+σt22∂xxf\(x\),\\mathcal\{L\}\_\{t\}f\(x\)=b\_\{t\}\(x\)\\partial\_\{x\}f\(x\)\+\\frac\{\\sigma\_\{t\}^\{2\}\}\{2\}\\partial\_\{xx\}f\(x\),\(33\)the diffusion semigroup satisfies

𝔼\[f\(Xt\+h\)∣Xt=x\]=f\(x\)\+hℒtf\(x\)\+h22ℒt2f\(x\)\+h36ℒt3f\(x\)\+o\(h3\)\.\\mathbb\{E\}\[f\(X\_\{t\+h\}\)\\mid X\_\{t\}=x\]=f\(x\)\+h\\mathcal\{L\}\_\{t\}f\(x\)\+\\frac\{h^\{2\}\}\{2\}\\mathcal\{L\}\_\{t\}^\{2\}f\(x\)\+\\frac\{h^\{3\}\}\{6\}\\mathcal\{L\}\_\{t\}^\{3\}f\(x\)\+o\(h^\{3\}\)\.\(34\)Applying this expansion tof\(x\)=xf\(x\)=x,f\(x\)=x2f\(x\)=x^\{2\}, andf\(x\)=x3f\(x\)=x^\{3\}, and substituting the resulting raw moments into

κ3\(h\)\\displaystyle\\kappa\_\{3\}\(h\)=𝔼\[Xt\+h3∣Xt=x\]−3𝔼\[Xt\+h∣Xt=x\]𝔼\[Xt\+h2∣Xt=x\]\\displaystyle=\\mathbb\{E\}\[X\_\{t\+h\}^\{3\}\\mid X\_\{t\}=x\]\-3\\mathbb\{E\}\[X\_\{t\+h\}\\mid X\_\{t\}=x\]\\,\\mathbb\{E\}\[X\_\{t\+h\}^\{2\}\\mid X\_\{t\}=x\]\(35\)\+2𝔼\[Xt\+h∣Xt=x\]3,\\displaystyle\\qquad\+2\\mathbb\{E\}\[X\_\{t\+h\}\\mid X\_\{t\}=x\]^\{3\},all terms up to orderh2h^\{2\}cancel, and the leading term is exactly

κ3\(h\)=σt4∂xxbt\(x\)h3\+o\(h3\)\.\\kappa\_\{3\}\(h\)=\\sigma\_\{t\}^\{4\}\\,\\partial\_\{xx\}b\_\{t\}\(x\)\\,h^\{3\}\+o\(h^\{3\}\)\.\(36\)The same leading term is obtained for smooth time\-inhomogeneous coefficients by the corresponding Itô–Taylor expansion; the displayed generator calculation captures the local spatial\-curvature obstruction\. Therefore, whenever∂xxbt\(x\)≠0\\partial\_\{xx\}b\_\{t\}\(x\)\\neq 0, the third centered moment is nonzero for all sufficiently small but finiteh\>0h\>0\. Hence the conditional transition law ofXt\+hX\_\{t\+h\}givenXt=xX\_\{t\}=xis not Gaussian\. This contradicts any representation of the form

Xt\+h=Gt,t\+h\(x;ψ,dψ,…\)\+At,t\+hϵ,ϵ∼𝒩\(0,I\),X\_\{t\+h\}=G\_\{t,t\+h\}\(x;\\psi,d\\psi,\\ldots\)\+A\_\{t,t\+h\}\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(37\)whose conditional distribution is Gaussian and therefore has zero third centered moment\.

Since the intervalt→t\+ht\\to t\+his a finite, nonzero transition and is a special case oft→rt\\to r, this already rules out such a simple additive\-Gaussian representation for general finite\-time SDE\-induced flow\-map transitions\. Thus, for nonlinear drift fields, the exact SDE transition must be described through stochastic integration over the full random path\. Only in the special linear\-Gaussian case, where the drift is affine in the state and the diffusion is state\-independent, does the finite\-time SDE transition remain Gaussian\. ∎

### B\.2Proof of[Theorem 3\.4](https://arxiv.org/html/2607.00535#S3.Thmtheorem4)

##### Theorem[3\.4](https://arxiv.org/html/2607.00535#S3.Thmtheorem4)

\(error of the local\-anchor approximation\) LetX~r⋆\\tilde\{X\}\_\{r\}^\{\\star\}denote the exact sample obtained by running the reverse\-time SDE transitionpr\|τ\(⋅\|Xτ\)p\_\{r\|\\tau\}\(\\cdot\|X\_\{\\tau\}\)fromτ=r\+Δr\\tau=r\+\\Delta rtorr, whereXr=ψt→r\(Xt\)X\_\{r\}=\\psi\_\{t\\to r\}\(X\_\{t\}\)andXτ=ψr→r\+Δr\(Xr\)X\_\{\\tau\}=\\psi\_\{r\\to r\+\\Delta r\}\(X\_\{r\}\)\. LetX~rloc\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}denote the local\-anchor approximationX~rloc=Xr−Δrλ2\[a˙rarXr−ur\(Xr\)\]\+σrΔrξ\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}=X\_\{r\}\-\\Delta r\\lambda^\{2\}\[\\frac\{\\dot\{a\}\_\{r\}\}\{a\_\{r\}\}X\_\{r\}\-u\_\{r\}\(X\_\{r\}\)\]\+\\sigma\_\{r\}\\sqrt\{\\Delta r\}\\xi,ξ∼𝒩\(0,I\)\\xi\\sim\\mathcal\{N\}\(0,I\), whereσr=λ2br\(a˙rbr−arb˙r\)ar\\sigma\_\{r\}=\\lambda\\sqrt\{\\frac\{2b\_\{r\}\(\\dot\{a\}\_\{r\}b\_\{r\}\-a\_\{r\}\\dot\{b\}\_\{r\}\)\}\{a\_\{r\}\}\}\. Assume thatus\(x\)u\_\{s\}\(x\),asa\_\{s\},bsb\_\{s\}, andσs\\sigma\_\{s\}are smooth with bounded derivatives in a neighborhood of the trajectory\. Then the local\-anchor approximation satisfies the strong error bound

‖X~r⋆−X~rloc‖L2=O\(\(Δr\)3/2\)\.\\\|\\tilde\{X\}\_\{r\}^\{\\star\}\-\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}\\\|\_\{L^\{2\}\}=O\(\(\\Delta r\)^\{3/2\}\)\.\(38\)
###### Proof\.

Define the reverse\-time drift

Bs\(x\)=us\(x\)−σs22∇xlog⁡ps\(x\)\.B\_\{s\}\(x\)=u\_\{s\}\(x\)\-\\frac\{\\sigma\_\{s\}^\{2\}\}\{2\}\\nabla\_\{x\}\\log p\_\{s\}\(x\)\.\(39\)The exact reverse\-time SDE transition fromτ=r\+Δr\\tau=r\+\\Delta rtorrcan be written as

X~r⋆=Xτ−∫rτBs\(Xs\)𝑑s\+∫rτσs𝑑Ws\.\\tilde\{X\}\_\{r\}^\{\\star\}=X\_\{\\tau\}\-\\int\_\{r\}^\{\\tau\}B\_\{s\}\(X\_\{s\}\)ds\+\\int\_\{r\}^\{\\tau\}\\sigma\_\{s\}dW\_\{s\}\.\(40\)By the Euler–Maruyama local expansion, using the same Gaussian incrementξ∼𝒩\(0,I\)\\xi\\sim\\mathcal\{N\}\(0,I\), we have

X~r⋆=Xτ−ΔrBτ\(Xτ\)\+στΔrξ\+Rsde,‖Rsde‖L2=O\(\(Δr\)3/2\)\.\\tilde\{X\}\_\{r\}^\{\\star\}=X\_\{\\tau\}\-\\Delta rB\_\{\\tau\}\(X\_\{\\tau\}\)\+\\sigma\_\{\\tau\}\\sqrt\{\\Delta r\}\\xi\+R\_\{\\mathrm\{sde\}\},\\qquad\\\|R\_\{\\mathrm\{sde\}\}\\\|\_\{L^\{2\}\}=O\(\(\\Delta r\)^\{3/2\}\)\.\(41\)For the affine probability path, the score satisfies

∇xlog⁡ps\(x\)=asus\(x\)−a˙sxbs\(a˙sbs−asb˙s\)\.\\nabla\_\{x\}\\log p\_\{s\}\(x\)=\\frac\{a\_\{s\}u\_\{s\}\(x\)\-\\dot\{a\}\_\{s\}x\}\{b\_\{s\}\(\\dot\{a\}\_\{s\}b\_\{s\}\-a\_\{s\}\\dot\{b\}\_\{s\}\)\}\.\(42\)Usingσs2=λ22bs\(a˙sbs−asb˙s\)as\\sigma\_\{s\}^\{2\}=\\lambda^\{2\}\\frac\{2b\_\{s\}\(\\dot\{a\}\_\{s\}b\_\{s\}\-a\_\{s\}\\dot\{b\}\_\{s\}\)\}\{a\_\{s\}\}, the reverse\-time drift becomes

Bs\(x\)=us\(x\)−λ2\[us\(x\)−a˙sasx\]=\(1−λ2\)us\(x\)\+λ2a˙sasx\.B\_\{s\}\(x\)=u\_\{s\}\(x\)\-\\lambda^\{2\}\[u\_\{s\}\(x\)\-\\frac\{\\dot\{a\}\_\{s\}\}\{a\_\{s\}\}x\]=\(1\-\\lambda^\{2\}\)u\_\{s\}\(x\)\+\\lambda^\{2\}\\frac\{\\dot\{a\}\_\{s\}\}\{a\_\{s\}\}x\.\(43\)Sinceτ=r\+Δr\\tau=r\+\\Delta rand the coefficients are smooth,

Bτ\(Xτ\)=Br\(Xr\)\+O\(Δr\),στ=σr\+O\(Δr\)\.B\_\{\\tau\}\(X\_\{\\tau\}\)=B\_\{r\}\(X\_\{r\}\)\+O\(\\Delta r\),\\qquad\\sigma\_\{\\tau\}=\\sigma\_\{r\}\+O\(\\Delta r\)\.\(44\)Therefore,

X~r⋆=Xτ−ΔrBr\(Xr\)\+σrΔrξ\+O\(\(Δr\)3/2\)\\tilde\{X\}\_\{r\}^\{\\star\}=X\_\{\\tau\}\-\\Delta rB\_\{r\}\(X\_\{r\}\)\+\\sigma\_\{r\}\\sqrt\{\\Delta r\}\\xi\+O\(\(\\Delta r\)^\{3/2\}\)\(45\)inL2L^\{2\}\. For the short deterministic segmentr→τr\\to\\tau, discretizing the ODE gives

Xτ=ψr→r\+Δr\(Xr\)=Xr\+Δrur\(Xr\)\+O\(\(Δr\)2\)\.X\_\{\\tau\}=\\psi\_\{r\\to r\+\\Delta r\}\(X\_\{r\}\)=X\_\{r\}\+\\Delta ru\_\{r\}\(X\_\{r\}\)\+O\(\(\\Delta r\)^\{2\}\)\.\(46\)Substituting this expression and the formula forBrB\_\{r\}gives

X~r⋆\\displaystyle\\tilde\{X\}\_\{r\}^\{\\star\}=Xr\+Δrur\(Xr\)−Δr\[\(1−λ2\)ur\(Xr\)\+λ2a˙rarXr\]\+σrΔrξ\+O\(\(Δr\)3/2\)\\displaystyle=X\_\{r\}\+\\Delta ru\_\{r\}\(X\_\{r\}\)\-\\Delta r\[\(1\-\\lambda^\{2\}\)u\_\{r\}\(X\_\{r\}\)\+\\lambda^\{2\}\\frac\{\\dot\{a\}\_\{r\}\}\{a\_\{r\}\}X\_\{r\}\]\+\\sigma\_\{r\}\\sqrt\{\\Delta r\}\\xi\+O\(\(\\Delta r\)^\{3/2\}\)\(47\)=Xr−Δrλ2\[a˙rarXr−ur\(Xr\)\]\+σrΔrξ\+O\(\(Δr\)3/2\)\.\\displaystyle=X\_\{r\}\-\\Delta r\\lambda^\{2\}\[\\frac\{\\dot\{a\}\_\{r\}\}\{a\_\{r\}\}X\_\{r\}\-u\_\{r\}\(X\_\{r\}\)\]\+\\sigma\_\{r\}\\sqrt\{\\Delta r\}\\xi\+O\(\(\\Delta r\)^\{3/2\}\)\.The leading terms are exactlyX~rloc\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}\. Hence

‖X~r⋆−X~rloc‖L2=O\(\(Δr\)3/2\),\\\|\\tilde\{X\}\_\{r\}^\{\\star\}\-\\tilde\{X\}\_\{r\}^\{\\mathrm\{loc\}\}\\\|\_\{L^\{2\}\}=O\(\(\\Delta r\)^\{3/2\}\),\(48\)which proves the claim\. ∎

## Appendix CExperiment Details

### C\.1Text\-to\-Image Post\-Training Details

##### Base checkpoints\.

We use the FLUX\.1\-lite few\-step generators released by T2I\-DistillPuet al\.\([2025](https://arxiv.org/html/2607.00535#bib.bib75)\)\. The released checkpoints include both MeanFlow and sCM variants\. MeanFlow is treated as a two\-time flow\-map model, which directly parameterizes transitionsψt→rθ\\psi^\{\\theta\}\_\{t\\to r\}, while sCM is treated as a one\-time endpoint flow\-map model, which predictsψt→1θ\\psi^\{\\theta\}\_\{t\\to 1\}\. For each method, the “Base” model denotes the corresponding released T2I\-Distill checkpoint without any RL post\-training\. The post\-trained model denotes the same frozen base generator plus a reward\-specific LoRA adapter trained with Flow\-Map GRPO\.

##### Reward\-specific post\-training\.

We train separate LoRA adapters for three reward families:

R∈\{PickScore,OCR,GenEval\}\.R\\in\\\{\\mathrm\{PickScore\},\\ \\mathrm\{OCR\},\\ \\mathrm\{GenEval\}\\\}\.\(49\)For each reward, the model is post\-trained on the corresponding prompt set and the reward is evaluated on the final generated imageLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\)\. All reward\-specific runs share the same training protocol unless otherwise stated\. In particular, the pretrained generator is frozen and only the LoRA adapter is updated\. This setup allows us to evaluate whether Flow\-Map GRPO can improve few\-step flow\-map generators without full\-parameter finetuning\.

##### Flow\-Map GRPO rollout\.

During RL post\-training, we useK=4K=4stochastic flow\-map transitions for all models and all rewards\. In implementation, the sampler uses a five\-node rollout schedule, where four transitions are valid stochastic policy steps and the final endpoint transition is deterministic\. The stochastic transitions are used to compute policy likelihood ratios, while the final deterministic transition is used only to obtain the decoded image for reward evaluation\. For a promptcc, we sample a group ofGGtrajectories, compute the final\-image rewards, normalize the rewards into group\-level advantages, and optimize the LoRA policy with the clipped GRPO objective in[Equation 17](https://arxiv.org/html/2607.00535#S3.E17)\.

##### Common training hyperparameters\.

Unless otherwise specified, the following hyperparameters are shared across PickScore, OCR, and GenEval post\-training runs, and across the MeanFlow and sCM backbones\.

Table 7:Common Flow\-Map GRPO post\-training hyperparameters for text\-to\-image experiments\.
##### LoRA configuration\.

All post\-training runs use the same LoRA architecture\. The LoRA rank is set to6464, the LoRA scaling factor is set to128128, and the LoRA dropout is set to0\.00\.0\. We apply LoRA to the attention projections and feed\-forward projections of the FLUX transformer:attn\.to\_q,attn\.to\_k,attn\.to\_v,attn\.to\_out\.0,attn\.add\_q\_proj,attn\.add\_k\_proj,attn\.add\_v\_proj,\\texttt\{attn\.add\\\_v\\\_proj\},attn\.to\_add\_out,ff\.net\.0\.proj,ff\.net\.2,ff\_context\.net\.0\.proj,ff\_context\.net\.2followingLiuet al\.\([2026](https://arxiv.org/html/2607.00535#bib.bib150)\)

##### Advantage normalization and GRPO objective\.

For each prompt group, rewards are gathered across devices and normalized before the policy update\. We use per\-prompt reward statistics together with a global batch standard deviation\. Concretely, for a trajectoryiiassociated with promptcc, the advantage is computed as

A^i=R\(x1i\|c\)−μcσglobal\+ϵstd,\\hat\{A\}^\{i\}=\\frac\{R\(x^\{i\}\_\{1\}\|c\)\-\\mu\_\{c\}\}\{\\sigma\_\{\\mathrm\{global\}\}\+\\epsilon\_\{\\mathrm\{std\}\}\},\(50\)whereμc\\mu\_\{c\}is the running per\-prompt reward mean,σglobal\\sigma\_\{\\mathrm\{global\}\}is the reward standard deviation computed over the gathered rollout batch, andϵstd=10−4\\epsilon\_\{\\mathrm\{std\}\}=10^\{\-4\}is a numerical stabilizer\. The policy update recomputes the log probability of each stochastic ASFMC transition and applies the clipped GRPO objective\.

##### Stochasticization during training\.

For MeanFlow, which directly parameterizes two\-time flow maps, we instantiate ASFMC on each stochastic segment\. The rollout uses a post\-flow stochastic kernel with stochasticity enabled during training\. The flow\-map delta is set to0\.030\.03, the terminal base sigma is set to0\.050\.05, and the post\-SDE and endpoint noise levels are set to0\.70\.7during training\. The final endpoint segment is deterministic\. For sCM, which predicts endpoint maps, we use the endpoint\-anchor version of ASFMC described in[subsubsection 3\.2\.2](https://arxiv.org/html/2607.00535#S3.SS2.SSS2)\. In both cases, only the stochastic ASFMC transitions contribute to the GRPO likelihood\-ratio objective\.

##### Evaluation protocol\.

All reported evaluations are deterministic\. We disable the training\-time stochasticity by setting the evaluation post\-SDE and endpoint noise levels to zero\. The base checkpoint and the corresponding LoRA checkpoint are evaluated with identical resolution, guidance scale, random seed, and inference schedule within each model family\. We report task\-level, perceptual, and preference\-based metrics, including GenEval, OCR accuracy, PickScore, aesthetic score, DQA, ImageReward, and UniReward\.

## Appendix DAdditional Results

### D\.1MeanFlow \- OCR

![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_ocr/prompt_046_grid.jpg)Figure 6:MeanFlow\-OCR comparison: A realistic photograph of a street sign with “one way” prominently displayed, set against a backdrop of a busy urban street with cars and pedestrians\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_ocr/prompt_053_grid.jpg)Figure 7:MeanFlow\-OCR comparison: A weathered treasure map laid out on an old wooden table, with “X Marks the Spot” clearly visible in the center, surrounded by intricate illustrations of mountains, forests, and a distant coastline, all under the warm glow of a vintage lamp\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_ocr/prompt_059_grid.jpg)Figure 8:MeanFlow\-OCR comparison: In a cozy cat cafe, a menu board displays “Purr Therapy 5 minute” among other offerings\. A fluffy gray cat sits nearby, looking relaxed and ready to offer its soothing presence to patrons\. The scene is warm and inviting, with soft lighting and comfortable seating\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_ocr/prompt_086_grid.jpg)Figure 9:MeanFlow\-OCR comparison: A bustling train station with a vintage aesthetic, the platform speaker hanging from a metal pole, clearly announcing “Now Boarding Track 9” amidst the crowd of travelers, luggage carts, and the distant steam of an approaching locomotive\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_ocr/prompt_091_grid.jpg)Figure 10:MeanFlow\-OCR comparison: A neon\-lit nightclub VIP section with a modern, sleek design\. The sign prominently displays “List Only” in bold, glowing letters, set against a backdrop of pulsing lights and a crowd enjoying the vibrant nightlife scene\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_ocr/prompt_108_grid.jpg)Figure 11:MeanFlow\-OCR comparison: A dimly lit, cozy restaurant with a fortune cookie slip prominently displayed, reading “Salmon will betray you”, next to a half\-empty plate of sushi\. The scene captures the moment of revelation, with a subtle, mysterious atmosphere\.
### D\.2MeanFlow \- PickScore

![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_pickscore/prompt_023_grid.jpg)Figure 12:MeanFlow\-PickScore comparison: selfie photo of jesus and pope\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_pickscore/prompt_038_grid.jpg)Figure 13:MeanFlow\-PickScore comparison: MGb car smashing through hole in the wall, sparks dust rubble bricks, studio lighting\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_pickscore/prompt_049_grid.jpg)Figure 14:MeanFlow\-PickScore comparison: Young man with an orange beard, cartoon style\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_pickscore/prompt_054_grid.jpg)Figure 15:MeanFlow\-PickScore comparison: gentleman frog in a top hat\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_pickscore/prompt_055_grid.jpg)Figure 16:MeanFlow\-PickScore comparison: mickey mouse in black and white film noir gangster style new york\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/meanflow_pickscore/prompt_064_grid.jpg)Figure 17:MeanFlow\-PickScore comparison: A portrait of a renaissance prince painted by Raphael\.
### D\.3Consistency Model \- OCR

![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_ocr/prompt_046_grid.jpg)Figure 18:sCM\-OCR comparison: A realistic photograph of a street sign with “one way” prominently displayed, set against a backdrop of a busy urban street with cars and pedestrians\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_ocr/prompt_075_grid.jpg)Figure 19:sCM\-OCR comparison: A close\-up shot of a movie theater popcorn bucket, prominently displaying the text “Extra Buttery” in bold, with the bucket filled to the brim with golden, buttery popcorn, set against a dark, cinematic background\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_ocr/prompt_087_grid.jpg)Figure 20:sCM\-OCR comparison: A close\-up of a futuristic robot chest plate, marked with the inscription “Model T800 Service Unit”, set against a sleek, metallic background, showcasing detailed mechanical components and subtle wear, emphasizing its advanced yet utilitarian design\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_ocr/prompt_095_grid.jpg)Figure 21:sCM\-OCR comparison: A stylish, modern tea packaging design for “Organic Chamomile Blend”, featuring a serene yellow and white color scheme with delicate illustrations of chamomile flowers and leaves, set against a clean, minimalist background\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_ocr/prompt_161_grid.jpg)Figure 22:sCM\-OCR comparison: A close\-up of a calculator screen displaying “Error 404”, set against a blurred background of math books and a desk, with a faint glow around the screen to highlight the error message\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_ocr/prompt_181_grid.jpg)Figure 23:sCM\-OCR comparison: A vibrant hot air balloon ascends into a clear blue sky, trailing a banner that reads “Adventure Awaits” in bold, flowing letters\. The balloon’s colorful pattern contrasts beautifully against the serene landscape below, inviting viewers to join in the journey\.
### D\.4Consistency Model \- PickScore

![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_pickscore/prompt_007_grid.jpg)Figure 24:sCM\-PickScore comparison: an epic angel dressed in blue with white wings\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_pickscore/prompt_033_grid.jpg)Figure 25:sCM\-PickScore comparison: A photo of a beautiful young woman, cute\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_pickscore/prompt_048_grid.jpg)Figure 26:sCM\-PickScore comparison: an overgrown abandoned red barn, covered in vines, sunlight filtering through, a stag deer standing in the entrance, 4k\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_pickscore/prompt_077_grid.jpg)Figure 27:sCM\-PickScore comparison: an anthropomorphic piebald wolf, medieval, adventurer, dnd, town, rpg, rustic, fantasy, hd digital art\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_pickscore/prompt_097_grid.jpg)Figure 28:sCM\-PickScore comparison: a happy nepalese girl in a village\.![Refer to caption](https://arxiv.org/html/2607.00535v1/images/result/scm_pickscore/prompt_149_grid.jpg)Figure 29:sCM\-PickScore comparison: fighting dwarf, old, priest, light, screaming, having shield\.
Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition

Similar Articles

Self-conditioned Flow Map Language Models via Fixed-point Flows

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

@aditya_oberai: What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & …

Flow Map Learning via Nongradient Vector Flow [pdf]

Submit Feedback

Similar Articles

Self-conditioned Flow Map Language Models via Fixed-point Flows
FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning
Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
@aditya_oberai: What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & …
Flow Map Learning via Nongradient Vector Flow [pdf]