SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
Summary
SwiftI2V is a new efficient framework for high-resolution image-to-video generation that uses conditional segment-wise generation to achieve 2K synthesis with significantly reduced computational costs. It enables practical generation on single consumer or datacenter GPUs while maintaining input fidelity.
View Cached Full Text
Cached at: 05/08/26, 07:10 AM
Paper page - SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
Source: https://huggingface.co/papers/2605.06356
Abstract
SwiftI2V is an efficient high-resolution image-to-video generation framework that uses conditional segment-wise generation and bidirectional contextual interaction to achieve scalable, input-faithful video synthesis with reduced computational requirements.
High-resolutionimage-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored forhigh-resolutionI2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introducesConditional Segment-wise Generation(CSG) to synthesize videos segment-by-segment with a bounded per-steptoken budget, and adoptsbidirectional contextual interactionwithin each segment to improvecross-segment coherenceandinput fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.06356
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06356 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06356 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06356 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SwiftVR: Real-Time One-Step Generative Video Restoration
SwiftVR is a real-time one-step generative video restoration framework that achieves high frame rates on consumer GPUs using efficient attention mechanisms and a lightweight restoration-aware autoencoder.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.
State-of-the-art video and image generation with Veo 2 and Imagen 3
Google announced Veo 2 and Imagen 3, state-of-the-art video and image generation models now available in VideoFX, ImageFX, and a new tool called Whisk. Veo 2 generates high-quality 4K videos with improved physics understanding and cinematography knowledge, while Imagen 3 produces brighter, better-composed images with diverse art styles.
Real-Time Long Video Generation (GitHub Repo)
NVlabs releases LongLive 2.0, a parallel infrastructure for real-time long video generation using NVFP4 quantization, supporting both training and inference. It achieves 45.7 FPS and is accepted at ICLR 2026.
@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…
Stanford and ByteDance introduce W-Flow, a single-step generative model that uses Wasserstein gradient flows to achieve state-of-the-art one-step ImageNet 256x256 generation (1.29 FID) with 100x faster sampling than multi-step diffusion models.