Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Summary
Flash-GRPO improves training efficiency for video diffusion models by addressing temporal variance and gradient inconsistency through iso-temporal grouping and temporal gradient rectification, achieving state-of-the-art alignment quality with substantial training acceleration.
View Cached Full Text
Cached at: 05/18/26, 02:23 AM
Paper page - Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Source: https://huggingface.co/papers/2605.15980 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
Flash-GRPO improves training efficiency for video diffusion models by addressing temporal variance and gradient inconsistency through iso-temporal grouping and temporal gradient rectification.
Group Relative Policy Optimizationhas emerged as essential for aligningvideo diffusion modelswith human preferences, but faces a critical computational bottleneck: training a 14Bparametered modeltypically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs throughsliding window subsamplingtraining timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, asingle-step training frameworkthat outperformsfull trajectory trainingin alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges:iso-temporal groupingeliminatestimestep-confounded varianceby enforcing prompt-wisetemporal consistency, decoupling policy performance from timestep difficulty;temporal gradient rectificationneutralizes the time-dependent scaling factor that causes vastly inconsistentgradient magnitudesacross timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO’s effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.
View arXiv pageView PDFProject pageGitHub7Add to collection
Get this paper in your agent:
hf papers read 2605\.15980
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15980 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15980 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15980 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO introduces an ODE-native online GRPO framework that aligns streaming autoregressive video generators with human preferences using causal-semantic KV cache exploration and a velocity-field surrogate policy, achieving consistent improvements in visual quality and alignment.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO introduces a stable RL training framework for uniform discrete diffusion models, boosting GenEval accuracy from 69% to 96% and OCR benchmark accuracy from 8% to 57%.
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
@probablynotaz9: Solo-author ICML paper alert Ever wanted to post-train your diffusion LLM with good old policy gradients, without havin…
This solo-author ICML paper introduces Amortized Group Relative Policy Optimization (AGRPO) to enable effective reinforcement learning post-training for diffusion language models.
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
F-GRPO proposes a factorized group-relative policy optimization framework that unifies candidate generation and ranking in a single autoregressive LLM, addressing credit assignment issues and improving top-ranked performance across sequential recommendation and multi-hop QA benchmarks.