@verityw_: Generalist robot policies learn many useful skills. How can we elicit relevant behaviors when faced with new tasks? We …

X AI KOLs Following Papers

Summary

Introduces Flow Reversal Steering (FRS), a method to refine coarse actions from semantic reasoning into precise robot actions by reversing and re-denoising through a flow-matching generalist policy, improving zero-shot control and enabling policy learning.

Generalist robot policies learn many useful skills. How can we elicit relevant behaviors when faced with new tasks? We introduce Flow Reversal Steering (FRS): a way to refine coarse actions produced by semantic reasoning into similar precise ones! https://t.co/BRdvq0OVg0 1/N https://t.co/ua8lRrgmzM
Original Article
View Cached Full Text

Cached at: 06/12/26, 05:00 PM

Generalist robot policies learn many useful skills. How can we elicit relevant behaviors when faced with new tasks? We introduce Flow Reversal Steering (FRS): a way to refine coarse actions produced by semantic reasoning into similar precise ones! https://t.co/BRdvq0OVg0 1/N https://t.co/ua8lRrgmzM


Improving Robotic Generalist Policies via Flow Reversal Steering

Source: https://flow-reversal-steering.github.io/ FRSZero-ShotDSBCDSRL + FRS## Flow Reversal Steering

1Stanford University,2UC Berkeley*Equal contribution

Abstract

Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging new tasks, we need a way to infer and invoke the appropriate actions from the policy’s rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and proposeFlow Reversal Steering (FRS): a method that takes suboptimal but “reasonable” actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions — showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.

Flow Reversal Steering (FRS)

Flow Reversal Steering: a coarse reference action from a reasoner is reversed through the flow field into latent noise, then denoised into a precise in-distribution action.

FRS passes coarse reference actions through a flow policy in reverse, finding latent noises which map to precise actions from a nearby behavioral mode.

t =1.00

Noise via flow reversalDenoise via flow matching

Generalist flow-matching robot policies learn a rich prior over behaviors. These policies contain many skills needed for novel tasks — provided these appropriate behaviors are elicited. How can we use semantic knowledge to steer generalist policies towards sampling “reasonable” actions for new tasks?

We thus proposeFlow Reversal Steering (FRS): a method for guiding generalist flow policies’ action sampling by finding underlying noises that map to semantically-reasonable behaviors. By passing a coarse reference action — which captures roughly how the robot should move — through the flow policy in reverse, FRS finds the noise which approximately maps to the action. When subsequently denoised,FRS effectively finds a nearby good behavioral mode from the generalist’s prior that is similar to the reference.

Standard Flow Denoising

# Sample action from flow policy (K integration steps)
x ~ N(0, I)                  # Sample noise
dt ← 1 / K
for t in 0 … 1:              # K forward steps, size dt
    x ← x + dt·vθ(x, t)
return x                     # action sample

Flow Reversal Steering

# Steer w/ reference action a_ref (K integration steps)
x ← a_ref
dt ← 1 / K
for t = 1 … 0:               # reverse the flow → noise
    x ← x − dt·vθ(x, t)
for t = 0 … 1:               # denoise → nearby mode
    x ← x + dt·vθ(x, t)
return x                     # refined, in-distribution

In turn, this allows semantic reasoners, like humans or VLMs, to guide the policy towards task-relevant good behaviors. The noises and actions produced by FRS can also be used for policy learning and improvement, especially via noise-space behavioral cloning and reinforcement learning.

Overview of the Flow Reversal Steering pipeline.

Overview of the FRS pipeline.

We demonstrateFRSusing state-of-the-artπ0.5vision-language-action models (VLAs)in both simulatedLIBEROand real-worldDROIDmanipulation tasks. We present three ways to useFRS:

  1. Zero-shotFRS: directly execute the refined actionsFRSelicits from coarse human or VLM guidance, with no additional training.
  2. Diffusion Steering via Behavioral Cloning (DSBC): distill good noises from flow reversal into a noise policy using supervised learning.
  3. Diffusion Steering via Reinforcement Learning (DSRL) +FRS: bootstrap noise-space reinforcement learning withFRSrollouts.

Zero-Shot Flow Reversal Steering

FRSconverts coarse actions into better fine-grained actions from the generalist policy’s prior. The simplest way to useFRSis to directly execute these refined actions without any training. Thezero-shot FRSinference loop involves: (1) querying a reasoner (e.g. human or VLM) for a coarse reference action, (2) passing the reference action through flow reversal to find the corresponding noise, and (3) denoising this noise to get the final action to execute at each step.

We make use of theGemini-ER-1.6 VLMto scalably produce semantically-meaningful coarse directional reference actions for evaluation in the LIBERO simulator.

Zero-shotFRSallows a VLM to guide the generalist policy based on its semantic reasonings, raising performance across LIBERO.

Zero-shot FRS improves over the base VLA across LIBERO tasks.Zero-shotFRSconverts coarse VLM actions into effective robot actions, outperforming the base policy and prior steering methods.Zero-shotFRSoutperforms the base VLA across LIBERO. On hard tasks where the base policy achieves ≤ 2% success,11 of 42tasks gain at least**10%**absolute success rate with zero-shotFRS. Other steering baselines only boost 3 or 4 such tasks in this way, suggesting FRS is better in low-success regimes which are especially hard for generalist improvement.FRSalso outperforms directly executing VLM actions, showing that flow reversalrefinesthe coarse reference actions, rather than simply reconstructing them.

Diffusion Steering via Behavioral Cloning (DSBC)

Flow policies can be steered via small auxiliary noise policies, which output noises which the generalist flow policy maps to good actions. However, finding good noises can be challenging —past works like Diffusion Steering via Reinforcement Learning (DSRL)require trial-and-error and Q-learning to find good noises. In contrast, flow reversal rapidly and scalably identifies good noises when simply given reference actions.

Thus, flow reversal enables training noise-steering policies viasupervised learning, rather than expensive reinforcement learning. We naturally call thisDiffusion Steering via Behavioral Cloning (DSBC).

Real-World (DROID)

Noise policies trained with DSBC can steer the generalist VLA on real-world DROID tasks.

Real-world DSBC results showing 80% success vs 20% base VLA on six DROID tasks.

DSBC improves real-world task performance by training on FRS data.

DSBC trained on 10 successful human-steeredFRSrollouts per task significantly outperforms the base π0.5VLA on six DROID tasks. Standard BC with an equivalent flow policy completely fails in this data regime.

Offline DSBC (Ours) solves the precise tape-hanging task, where the base VLA and standard flow BC fail.

Offline DSBC results on DROID tasks.DSBC can also be trained fully offline from a fixed dataset of teleoperated trajectories.Alternatively, DSBC can also be used with standardofflinerobotic demonstration data, e.g., collected via teleoperation. Flow reversal can augment each frame with the noise that approximately maps to the corresponding action chunk, then the DSBC noise policy can be trained on this augmented dataset. We demonstrate this on a more precise tape hanging task, training on 20 teleoperated demonstrations.

Simulation (LIBERO)

DSBC distills FRS trajectories to match zero-shot FRS performance on LIBERO.DSBC on latent noise actions matches zero-shotFRSand beats standard BC on robot actions.We also evaluate DSBC on LIBERO, where it distills the performance gains of zero-shot VLM FRS on 15 tasks, outperforming standard BC. DSBC is also highly efficient: each noise policy trains in under1 minuteusing ~1 GBof GPU memory, without loading the full VLA.

We hypothesize that when the noise policy enters out-of-distribution states, the VLA maps its outputs back to reasonable in-distribution actions, providing implicit robustness against compounding error.

Diffusion Steering via Reinforcement Learning (DSRL) +FRS

Bootstrapping DSRL withFRStrajectories enables faster learning and higher final success on hard tasks. We propose two simple augmentations forDSRL +FRS: (1) prefilling the replay buffer with zero-shotFRSrollouts and (2) adding a BC auxiliary loss on successful trajectories’ noise actions.

DSRL plus FRS outperforms standard DSRL on LIBERO tasks.

**Left:**DSRL +FRSon 15 tasks with effective zero-shot steering. **Right:**on 10 hard tasks whereFRSand the base policy both perform poorly, using even oneFRSsuccess drastically improves RL.

On 15 LIBERO-90 tasks, it learns faster and reaches higher final success than standard DSRL and PLD-style residual RL. On 10 harder tasks where the base VLA gets ~0% and zero-shotFRSreaches only 8%,even a single successful steered trajectory enables DSRL to learn substantially faster and better. By directing the learner toward semantically-meaningful behaviors early in training, DSRL +FRSimproves performance in the sparse-reward regime where standard generalist RL struggles.

As an example, DSRL +FRSlearns on Task 52:pick up the milk and put it in the basket, whereas standard DSRL completely fails, as the base VLA struggles to solve the task without semantic steering from VLMs.

BibTeX

@article{tang2026frs,
  author  = {Andy Tang and William Chen and Andrew Wagenmaker and Chelsea Finn and Sergey Levine},
  title   = {Improving Robotic Generalist Policies via Flow Reversal Steering},
  year    = {2026},
}

Similar Articles

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Hugging Face Daily Papers

RoboLab is a high-fidelity simulation benchmarking framework for evaluating task-generalist robotic policies, introducing the RoboLab-120 benchmark with 120 tasks across visual, procedural, and relational competency axes. It enables scalable, realistic task generation and systematic analysis of policy behavior under controlled perturbations to assess true generalization capabilities.

Generalizing from simulation

OpenAI Blog

OpenAI describes challenges with conventional RL on robotics tasks and introduces Hindsight Experience Replay (HER), a new RL algorithm that enables agents to learn from binary rewards by reframing failures as intended outcomes, combined with domain randomization for sim-to-real transfer.