LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Hugging Face Daily Papers Papers

Summary

LongLive-2.0 introduces an NVFP4-based parallel infrastructure for long video generation, achieving up to 2.15x training speedup and 1.84x inference speedup with a 5B model reaching 45.7 FPS.

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:31 AM

Paper page - LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Source: https://huggingface.co/papers/2605.18739 Published on May 18

#1 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

LongLive-2.0 presents an NVFP4-based parallel infrastructure for long video generation that addresses training and inference bottlenecks through sequence-parallel autoregressive training and diffusion model tuning.

We present LongLive-2.0, anNVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated asBalanced SP, which co-designs the efficientteacher-forcing layoutwith SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunkedVAE encoding. Combined withNVFP4precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes adiffusion modelinto a long, multi-shot, interactive auto-regressive (AR)diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standaloneLoRA weights. For inference onBlackwell GPUs, we enableW4A4NVFP4inference, quantize KV cache intoNVFP4for memory savings, and boost end-to-end throughput withasynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed onBlackwell GPUs, while the quantized KV cache can lowerinter-GPU communicationof SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the firstNVFP4training and inference system for long video generation.

View arXiv pageView PDFProject pageGitHub1.22kAdd to collection

Get this paper in your agent:

hf papers read 2605\.18739

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper3

#### Efficient-Large-Model/LongLive-2.0-5B Text-to-Video• Updatedabout 4 hours ago • 6 #### Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4 Text-to-Video• Updatedabout 4 hours ago • 1 #### Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S2 Text-to-Video• Updatedabout 4 hours ago

Datasets citing this paper1

#### Efficient-Large-Model/LongLive2-Toy-Dataset Updatedabout 4 hours ago

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18739 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Real-Time Long Video Generation (GitHub Repo)

TLDR AI

NVlabs releases LongLive 2.0, a parallel infrastructure for real-time long video generation using NVFP4 quantization, supporting both training and inference. It achieves 45.7 FPS and is accepted at ICLR 2026.