LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
Summary
LongLive-2.0 introduces an NVFP4-based parallel infrastructure for long video generation, achieving up to 2.15x training speedup and 1.84x inference speedup with a 5B model reaching 45.7 FPS.
View Cached Full Text
Cached at: 05/19/26, 06:31 AM
Paper page - LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
Source: https://huggingface.co/papers/2605.18739 Published on May 18
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
LongLive-2.0 presents an NVFP4-based parallel infrastructure for long video generation that addresses training and inference bottlenecks through sequence-parallel autoregressive training and diffusion model tuning.
We present LongLive-2.0, anNVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated asBalanced SP, which co-designs the efficientteacher-forcing layoutwith SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunkedVAE encoding. Combined withNVFP4precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes adiffusion modelinto a long, multi-shot, interactive auto-regressive (AR)diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standaloneLoRA weights. For inference onBlackwell GPUs, we enableW4A4NVFP4inference, quantize KV cache intoNVFP4for memory savings, and boost end-to-end throughput withasynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed onBlackwell GPUs, while the quantized KV cache can lowerinter-GPU communicationof SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the firstNVFP4training and inference system for long video generation.
View arXiv pageView PDFProject pageGitHub1.22kAdd to collection
Get this paper in your agent:
hf papers read 2605\.18739
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper3
#### Efficient-Large-Model/LongLive-2.0-5B Text-to-Video• Updatedabout 4 hours ago • 6
#### Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4 Text-to-Video• Updatedabout 4 hours ago • 1
#### Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S2 Text-to-Video• Updatedabout 4 hours ago
Datasets citing this paper1
#### Efficient-Large-Model/LongLive2-Toy-Dataset Updatedabout 4 hours ago
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18739 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Real-Time Long Video Generation (GitHub Repo)
NVlabs releases LongLive 2.0, a parallel infrastructure for real-time long video generation using NVFP4 quantization, supporting both training and inference. It achieves 45.7 FPS and is accepted at ICLR 2026.
@yukangchen_: We released a blog on "Why Video Gen Is an Infra Problem". https://research.nvidia.com/labs/eai/blogs/video-gen-is-an-i…
NVIDIA research blog argues that long video generation is becoming an infrastructure problem requiring full-stack co-design across models, memory, KV cache, VAE decoding, scheduling, and deployment, using LongLive 2.0 as a case study.
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
LongLive-RAG formulates long video generation as a retrieval-augmented generation problem, using a dynamic memory of previously generated latents to reduce error accumulation and identity drift, achieving improved quality across multiple autoregressive backbones.
@HowToAI_: NVIDIA has done the impossible and nobody's talking about it. They trained a 12 BILLION parameter LLM in 4-bit precisio…
NVIDIA trained a 12-billion parameter LLM in 4-bit precision using the new NVFP4 format with micro-scaling, achieving near-zero intelligence loss while halving memory usage and tripling arithmetic speed, marking a major breakthrough in efficient AI training.
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
LiteFrame proposes a lightweight video encoder with Compressed Token Distillation training that reduces latency and enables processing 8x more frames for long-form video understanding in Video LLMs, improving accuracy while reducing compute.