Real-Time Long Video Generation (GitHub Repo)
Summary
NVlabs releases LongLive 2.0, a parallel infrastructure for real-time long video generation using NVFP4 quantization, supporting both training and inference. It achieves 45.7 FPS and is accepted at ICLR 2026.
View Cached Full Text
Cached at: 05/21/26, 06:36 AM
NVlabs/LongLive
Source: https://github.com/NVlabs/LongLive
🎬 LongLive 2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
💡 TLDR: Infra with NVFP4 and parallelism for both training and inference
News
- 🔥 [2026.05.13] We release LongLive 2.0, infra with NVFP4, parallelism and multi-shot for AR training, DMD distillation, and inference (⚡45.7 FPS). The original LongLive 1.0 is now in the v1.0 branch.
- 🔥 [2026.04.12] LongLive supports kv cache compression with TriAttention, with 50% KV reduction and no quality drop. Check it here
- 🎉 [2026.1.27] LongLive is accepted by ICLR-2026.
- 🔥 [2026.1.11] LongLive supports adapting LongLive’s original RoPE into KV-cache relative RoPE and generates infinite long videos!
- 🔥 [2025.11.3] We implement LongLive on linear attention model SANA-Video! Now SANA-Video can generate 60s interactive videos in real-time.
- 🔥 [2025.9.29] We release Paper, this GitHub repo LongLive with all training and inference code, the model weight LongLive-1.3B, and demo page Website.
Introduction
LongLive 1.0: Real-time Interactive Long Video Generation. You can find it here in our V1.0 branch.
LongLive 2.0: an NVFP4 Parallel Infrastructure for Long Video Generation
- For training, it supports
- Balanced sequence parallel for AR training (teacher-forcing).
- AR training on multi-shot (or single-shot) videos.
- NVFP4 (or BF16) for both AR training and few-step distillation.
- For inference, it supports
- NVFP4 inference (W4A4) and NVFP4 KV Cache.
- Multi-shot attention sink.
- Sequence parallel inference.
- Async decoding.
LongLive 1.0: Real-time Interactive Long Video Generation. It accepts sequential user prompts and generates corresponding videos in real time, enabling user-guided long video generation. The key insights are attention sink, KV-recache, and streaming long tuning.
Getting Started
Quick Start
BF16
import torch
from omegaconf import OmegaConf
from pipeline import CausalDiffusionInferencePipeline
from utils.config import normalize_config
from utils.inference_utils import (
load_generator_checkpoint,
place_vae_for_streaming,
prepare_single_prompt_inputs,
save_video,
)
prompt = "A compact silver robot walks through a clean robotics lab."
merged_checkpoint_path = "LongLive-2.0-5B/model_bf16.pt"
config = normalize_config(OmegaConf.load("configs/inference.yaml"))
device = torch.device("cuda")
torch.set_grad_enabled(False)
pipe = CausalDiffusionInferencePipeline(config, device=device)
load_generator_checkpoint(pipe.generator, merged_checkpoint_path)
pipe = pipe.to(device=device, dtype=torch.bfloat16)
place_vae_for_streaming(pipe, config) # honor streaming_vae + vae_device when set
pipe.generator.model.eval().requires_grad_(False)
noise, prompts = prepare_single_prompt_inputs(config, prompt, device)
video = pipe.inference(noise=noise, text_prompts=prompts)
save_video(video[0], "videos/quickstart/sample.mp4", fps=24)
place_vae_for_streaming is a no-op unless inference.streaming_vae is true and inference.vae_device is set, so toggling streaming-pipeline decode in your yaml is enough — the script does not need to change.
NVFP4
Point checkpoints.generator_ckpt in configs/nvfp4/inference_nvfp4.yaml at the downloaded checkpoint and set model_quant_use_transformer_engine according to the backend you are using:
- TransformerEngine checkpoint (
model_te.pt):model_quant_use_transformer_engine: true - FourOverSix checkpoint (
model_4o6.pt):model_quant_use_transformer_engine: false
setup_nvfp4_pipeline handles checkpoint loading, NVFP4 module wrapping, weight materialization, dtype/device placement, and the streaming-pipeline VAE relocation for both backends — the bf16 pipe.to(...) shortcut is unsafe here because it would cast the quantized buffers.
import torch
from omegaconf import OmegaConf
from pipeline import CausalDiffusionInferencePipeline
from utils.config import normalize_config
from utils.inference_utils import prepare_single_prompt_inputs, save_video, setup_nvfp4_pipeline
prompt = "A compact silver robot walks through a clean robotics lab."
config = normalize_config(OmegaConf.load("configs/nvfp4/inference_nvfp4.yaml"))
device = torch.device("cuda")
torch.set_grad_enabled(False)
pipe = CausalDiffusionInferencePipeline(config, device=device)
setup_nvfp4_pipeline(pipe, config, device)
pipe.generator.model.eval().requires_grad_(False)
noise, prompts = prepare_single_prompt_inputs(config, prompt, device)
video = pipe.inference(noise=noise, text_prompts=prompts)
save_video(video[0], "videos/quickstart/sample_nvfp4.mp4", fps=24)
Models
| Model | FPS ↑ | Params | VBench ↑ | Multi-shot |
|---|---|---|---|---|
| LongLive-1.3B | 20.7 | 1.3B | 84.87 | |
| LongLive-2.0-5B | 24.8 | 5B | 85.06 | ✅ |
| LongLive-2.0-5B-NVFP4-4Step | 29.7 | 5B | 84.51 | ✅ |
| LongLive-2.0-5B-NVFP4-2Step | 45.7 | 5B | 83.14 | ✅ |
License
This repository is released under the Apache 2.0 license. See LICENSE for details.
Citation
Please consider citing our work if you find them useful:
@article{longlive_2.0,
title={LongLive2.0: An NVFP4 Parallel Infrastructure for Long Video Generation},
author={Chen, Yukang and Wang, Luozhou and Huang, Wei and Yang, Shuai and Zhang, Bohan and Xiao, Yicheng and Chu, Ruihang and Mao, Weian and Hu, Qixin and Liu, Shaoteng and Zhao, Yuyang and Mao, Huizi and Chen, Ying-Cong and Xie, Enze and Qi, Xiaojuan and Han, Song},
journal={arXiv preprint arXiv},
year={2026}
}
@inproceedings{longlive,
title={Longlive: Real-time interactive long video generation},
author={Yang, Shuai and Huang, Wei and Chu, Ruihang and Xiao, Yicheng and Zhao, Yuyang and Wang, Xianbang and Li, Muyang and Xie, Enze and Chen, Yingcong and Lu, Yao and others},
booktitle={ICLR},
year={2026},
}
Acknowledgement
- Self-Forcing: the AR training codebase and formulation we build upon.
- Wan2.2: the base video diffusion model components used in this release.
Similar Articles
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
LongLive-2.0 introduces an NVFP4-based parallel infrastructure for long video generation, achieving up to 2.15x training speedup and 1.84x inference speedup with a 5B model reaching 45.7 FPS.
@yukangchen_: We released a blog on "Why Video Gen Is an Infra Problem". https://research.nvidia.com/labs/eai/blogs/video-gen-is-an-i…
NVIDIA research blog argues that long video generation is becoming an infrastructure problem requiring full-stack co-design across models, memory, KV cache, VAE decoding, scheduling, and deployment, using LongLive 2.0 as a case study.
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
LongLive-RAG formulates long video generation as a retrieval-augmented generation problem, using a dynamic memory of previously generated latents to reduce error accumulation and identity drift, achieving improved quality across multiple autoregressive backbones.
Long Video Generation (4 minute read)
The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.
NVlabs/Sana
NVlabs/Sana is an efficiency-oriented open-source codebase for high-resolution image and video generation, including multiple model variants and training/inference pipelines.
