SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Hugging Face Daily Papers Papers

Summary

SwiftI2V is a new efficient framework for high-resolution image-to-video generation that uses conditional segment-wise generation to achieve 2K synthesis with significantly reduced computational costs. It enables practical generation on single consumer or datacenter GPUs while maintaining input fidelity.

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:10 AM

Paper page - SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Source: https://huggingface.co/papers/2605.06356

Abstract

SwiftI2V is an efficient high-resolution image-to-video generation framework that uses conditional segment-wise generation and bidirectional contextual interaction to achieve scalable, input-faithful video synthesis with reduced computational requirements.

High-resolutionimage-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored forhigh-resolutionI2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introducesConditional Segment-wise Generation(CSG) to synthesize videos segment-by-segment with a bounded per-steptoken budget, and adoptsbidirectional contextual interactionwithin each segment to improvecross-segment coherenceandinput fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

View arXiv pageView PDFProject pageGitHub2Add to collection

Get this paper in your agent:

hf papers read 2605\.06356

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06356 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06356 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06356 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

SwiftVR: Real-Time One-Step Generative Video Restoration

Hugging Face Daily Papers

SwiftVR is a real-time one-step generative video restoration framework that achieves high frame rates on consumer GPUs using efficient attention mechanisms and a lightweight restoration-aware autoencoder.

State-of-the-art video and image generation with Veo 2 and Imagen 3

Google DeepMind Blog

Google announced Veo 2 and Imagen 3, state-of-the-art video and image generation models now available in VideoFX, ImageFX, and a new tool called Whisk. Veo 2 generates high-quality 4K videos with improved physics understanding and cinematography knowledge, while Imagen 3 produces brighter, better-composed images with diverse art styles.

Real-Time Long Video Generation (GitHub Repo)

TLDR AI

NVlabs releases LongLive 2.0, a parallel infrastructure for real-time long video generation using NVFP4 quantization, supporting both training and inference. It achieves 45.7 FPS and is accepted at ICLR 2026.