SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

Summary

SwiftI2V is a new efficient framework for high-resolution image-to-video generation that uses conditional segment-wise generation to achieve 2K synthesis with significantly reduced computational costs. It enables practical generation on single consumer or datacenter GPUs while maintaining input fidelity.

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

Original Article

View Cached Full Text

Cached at: 05/08/26, 07:10 AM

Paper page - SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Source: https://huggingface.co/papers/2605.06356

Abstract

SwiftI2V is an efficient high-resolution image-to-video generation framework that uses conditional segment-wise generation and bidirectional contextual interaction to achieve scalable, input-faithful video synthesis with reduced computational requirements.

High-resolutionimage-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored forhigh-resolutionI2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introducesConditional Segment-wise Generation(CSG) to synthesize videos segment-by-segment with a bounded per-steptoken budget, and adoptsbidirectional contextual interactionwithin each segment to improvecross-segment coherenceandinput fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.06356

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06356 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06356 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06356 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Paper page - SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SwiftVR: Real-Time One-Step Generative Video Restoration

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

State-of-the-art video and image generation with Veo 2 and Imagen 3

Real-Time Long Video Generation (GitHub Repo)

@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…

Submit Feedback

Similar Articles

SwiftVR: Real-Time One-Step Generative Video Restoration

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

State-of-the-art video and image generation with Veo 2 and Imagen 3

Real-Time Long Video Generation (GitHub Repo)

@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…