Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

Hugging Face Daily Papers Papers

Summary

MIGA is a train-free method for generating consistent long videos by reducing the training-inference gap and enhancing temporal consistency through dual consistency mechanisms.

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose MIGA, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.
Original Article
View Cached Full Text

Cached at: 05/21/26, 10:10 AM

Paper page - Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

Source: https://huggingface.co/papers/2605.18233

Abstract

MIGA addresses long video generation challenges by reducing training-inference gaps and enhancing temporal consistency through dual consistency mechanisms.

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos.Frame-level autoregressive frameworks, e.g.,FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose MIGA, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates thetraining-inference gapby reducing the excessivenoise spanfed to the model. We then introduce an innovative dual consistency enhancement mechanism, where theself-reflection approachcorrects early high-noise frames and thelong-range frame guidanceapproach leverages later low-noise frames with broad coverage to steer generation, jointly improvingtemporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.18233

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18233 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18233 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18233 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

Hugging Face Daily Papers

The paper introduces MilliVid, a method for improving long-range consistency in video generation by using a multi-scale autoencoder to compress frames into hierarchical tokens and then generating them with a coarse-to-fine diffusion model, outperforming baselines on Minecraft videos.

Memento: Reconstruct to Remember for Consistent Long Video Generation

Hugging Face Daily Papers

Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms, achieving state-of-the-art performance in long-term subject consistency and cross-shot coherence.

Long Video Generation (4 minute read)

TLDR AI

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.