RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling
Summary
RhymeFlow accelerates diffusion transformers for video generation by decoupling denoising trajectories across frames, using keyframe anchoring and latent trajectory projection to reduce computational overhead while maintaining visual quality.
View Cached Full Text
Cached at: 06/15/26, 09:03 AM
Paper page - RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling
Source: https://huggingface.co/papers/2606.06309
Abstract
RhymeFlow accelerates diffusion transformers for video generation by decoupling denoising trajectories across frames, using keyframe anchoring and latent trajectory projection to maintain visual quality while reducing computational overhead.
Video generationmodels based onDiffusion Transformers(DiTs) have achieved remarkable performance in video synthesis, yet they suffer from highinference latencyand computational costs due to the quadratic complexity of3D attention. Existing acceleration methods primarily reducecomputational complexitywithin each individual denoising steps through techniques such assparse attentionandKV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, densedenoising processacross all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, whenkeyframeswith critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, densedenoising processis inherently redundant for natural video data. To this end, we introduce RhymeFlow, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only thesekeyframesundergo dense, step-by-step denoising to ensure structural integrity, while non-keyframesprogressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframesbreak thetemporal coherencein keyframe denoising steps, leading to visual degradation, we further introduce alatent trajectory projectionmodule, which enableskeyframesto interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-basedvideo generationmodels demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2606\.06309
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.06309 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.06309 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.06309 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
On-Policy Adversarial Flow Distillation for Autoregressive Video Generation
Proposes Adversarial Flow Distillation (AFD) for distilling heterogeneous black-box video generation models into autoregressive students, using on-policy feedback and forward-process flow-matching updates.
NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation
This paper introduces NaturalFlow, a fluency-aware optimization framework that reduces disruptive pauses in simultaneous speech-to-speech translation by leveraging model-internal signals, achieving a balance between low latency and natural speech flow.
Asymmetric Flow Models
Asymmetric Flow Modeling (AsymFlow) restricts noise prediction to low-rank subspaces for efficient high-dimensional flow-based generation, achieving state-of-the-art results on ImageNet and text-to-image tasks by fine-tuning from latent flow models.
AsymFlow Claims More Realistic AI Images by Moving Beyond Latent Diffusion
AsymFlow is a new method from Stanford that converts latent diffusion models to pixel space, achieving more realistic images by avoiding information loss from compression. It surpasses FLUX.2 klein on benchmarks with lower computational cost.
@__JohnNguyen__: Today we released the code for our CVPR 2026 paper, Flowception. Flowception bridges fully bidirectional sequence model…
Meta's FAIR team released the code for Flowception, a CVPR 2026 paper presenting a non-autoregressive video generation framework that interleaves frame insertion with continuous denoising to reduce error accumulation and computational cost.