KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
Summary
KVPO introduces an ODE-native online GRPO framework that aligns streaming autoregressive video generators with human preferences using causal-semantic KV cache exploration and a velocity-field surrogate policy, achieving consistent improvements in visual quality and alignment.
View Cached Full Text
Cached at: 05/19/26, 06:31 AM
Paper page - KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
Source: https://huggingface.co/papers/2605.14278 Published on May 14
·
Submitted byhttps://huggingface.co/kkakkkka
kkakaon May 19
Abstract
ODENative online GRPO framework KVPO aligns streaming video generators with human preferences through causal-semantic exploration and velocity-field surrogate policy based on trajectory velocity energy.
Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existingreinforcement learningmethods predominantly rely onnoise-based explorationandSDE-based surrogate policiesthat are mismatched to the deterministicODE dynamicsofdistilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native onlineGroup Relative Policy Optimization(GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces acausal-semantic explorationparadigm that relocates the source of variation from stochastic noise to the historicalKV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based onTrajectory Velocity Energy(TVE), which quantifies branch likelihood inflow-matching velocity spaceand yields areward-weighted contrastive objectivefully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.
View arXiv pageView PDFProject pageGitHub6Add to collection
Get this paper in your agent:
hf papers read 2605\.14278
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.14278 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.14278 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.14278 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Flash-GRPO improves training efficiency for video diffusion models by addressing temporal variance and gradient inconsistency through iso-temporal grouping and temporal gradient rectification, achieving state-of-the-art alignment quality with substantial training acceleration.
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
RAVEN introduces a real-time autoregressive video extrapolation framework with CM-GRPO, a novel reinforcement learning method for consistency model sampling, improving long-horizon generation quality.
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
This paper introduces Forcing-KV, a hybrid KV cache compression strategy for autoregressive video diffusion models that separates attention heads into static and dynamic categories, achieving up to 2.82x speedup at 1080P resolution while maintaining output quality.
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA replaces per-head KV caches in video diffusion models with a shared low-rank latent and decoupled 3D-RoPE positional keys, reducing per-token KV memory by 92.7% and improving throughput by 1.23x on a B200 while maintaining quality on VBench benchmarks.
Speculative Decoding for Autoregressive Video Generation
SDVG adapts speculative decoding to autoregressive video diffusion, using an image-quality router to achieve up to 2.09× speed-up with 95.7% quality retention on MovieGenVideoBench.