实时长视频生成（GitHub仓库）

TLDR AI 2026/05/20 00:00 模型

video-generation real-time parallel-infrastructure nvfp4 long-video open-source nvidia

摘要

NVlabs 发布了 LongLive 2.0，这是一个采用 NVFP4 量化的实时长视频生成并行基础设施，同时支持训练和推理。它达到了 45.7 FPS，并被 ICLR 2026 接收。

NVIDIA 的 LongLive 1.0 是一个用于交互式长视频生成的框架，它支持顺序提示和实时用户引导编辑，使用了流式注意力和 KV 缓存优化技术。

查看原文

查看缓存全文

缓存时间: 2026/05/21 06:36

NVlabs/LongLive 源代码：https://github.com/NVlabs/LongLive # 🎬 LongLive 2.0：用于长视频生成的NVFP4并行基础设施论文（https://arxiv.org/abs/2605.18739）代码（https://github.com/NVlabs/LongLive）视频（https://www.youtube.com/watch?v=7oQALy32fiU）模型（https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B）模型（https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4）演示（https://nvlabs.github.io/LongLive/LongLive2/）文档（https://nvlabs.github.io/LongLive/LongLive2/docs/）观看视频（https://www.youtube.com/watch?v=7oQALy32fiU） ## 💡 概览：采用NVFP4和并行性的训练与推理基础设施 ## 动态 - 🔥 [2026.05.13] 我们发布了 LongLive 2.0，一个采用NVFP4、并行性和多镜头技术，支持自回归训练、DMD蒸馏和推理（⚡45.7 FPS）的基础设施。原来的LongLive 1.0现在位于v1.0（https://github.com/NVlabs/LongLive/tree/v1.0）分支。 - 🔥 [2026.04.12] LongLive支持使用TriAttention（https://github.com/WeianMao/triattention）进行KV缓存压缩，KV减少50%且无质量损失。在此查看（https://github.com/WeianMao/triattention/tree/main/longlive）。 - 🎉 [2026.1.27] LongLive被 ICLR-2026 接收。 - 🔥 [2026.1.11] LongLive支持将LongLive原始的RoPE适配为KV缓存相对RoPE，可生成长度无限的视频！ - 🔥 [2025.11.3] 我们在线性注意力模型SANA-Video（https://nvlabs.github.io/Sana/Video/）上实现了LongLive！现在SANA-Video能够实时生成60秒的交互式视频。 - 🔥 [2025.9.29] 我们发布了论文（https://arxiv.org/abs/2509.22622）、包含所有训练和推理代码的GitHub仓库LongLive（https://github.com/NVlabs/LongLive）、模型权重LongLive-1.3B（https://huggingface.co/Efficient-Large-Model/LongLive-1.3B）以及演示页面网站（https://nvlabs.github.io/LongLive）。 ## 简介 LongLive 1.0：实时交互式长视频生成。你可以在我们V1.0分支（https://github.com/NVlabs/LongLive/tree/v1.0）中找到它。 LongLive 2.0：用于长视频生成的NVFP4并行基础设施。 - 训练支持 - [x] 用于自回归训练（teacher-forcing）的平衡序列并行。 - [x] 多镜头（或单镜头）视频的自回归训练。 - [x] 同时支持自回归训练和少步蒸馏中的NVFP4（或BF16）。 - 推理支持 - [x] NVFP4推理（W4A4）和NVFP4 KV缓存。 - [x] 多镜头注意力汇合（attention sink）。 - [x] 序列并行推理。 - [x] 异步解码。 LongLive 1.0：实时交互式长视频生成。它接收用户的连续提示并实时生成对应视频，实现用户引导的长视频生成。核心思路是注意力汇合、KV重新缓存和流式长微调（streaming long tuning）。 ## 快速开始 - 完整文档（https://nvlabs.github.io/LongLive/LongLive2/docs/） - 安装（https://nvlabs.github.io/LongLive/LongLive2/docs/#installation） - NVFP4设置（https://nvlabs.github.io/LongLive/LongLive2/docs/#nvfp4-installation） - 训练（https://nvlabs.github.io/LongLive/LongLive2/docs/#training） - 推理（https://nvlabs.github.io/LongLive/LongLive2/docs/#inference） - 数据组织（https://nvlabs.github.io/LongLive/LongLive2/docs/#training-data） ### 快速启动 #### BF16 python import torch from omegaconf import OmegaConf from pipeline import CausalDiffusionInferencePipeline from utils.config import normalize_config from utils.inference_utils import ( load_generator_checkpoint, place_vae_for_streaming, prepare_single_prompt_inputs, save_video, ) prompt = "A compact silver robot walks through a clean robotics lab." merged_checkpoint_path = "LongLive-2.0-5B/model_bf16.pt" config = normalize_config(OmegaConf.load("configs/inference.yaml")) device = torch.device("cuda") torch.set_grad_enabled(False) pipe = CausalDiffusionInferencePipeline(config, device=device) load_generator_checkpoint(pipe.generator, merged_checkpoint_path) pipe = pipe.to(device=device, dtype=torch.bfloat16) place_vae_for_streaming(pipe, config) # 当设置 streaming_vae + vae_device 时生效 pipe.generator.model.eval().requires_grad_(False) noise, prompts = prepare_single_prompt_inputs(config, prompt, device) video = pipe.inference(noise=noise, text_prompts=prompts) save_video(video[0], "videos/quickstart/sample.mp4", fps=24) 除非 `inference.streaming_vae` 为 true 且 `inference.vae_device` 已设置，否则 `place_vae_for_streaming` 无操作。因此，只需在 yaml 中切换流式管道解码即可，脚本无需修改。 #### NVFP4 在 `configs/nvfp4/inference_nvfp4.yaml` 中将 `checkpoints.generator_ckpt` 指向下载的检查点，并根据所使用的后端设置 `model_quant_use_transformer_engine`： - TransformerEngine 检查点（`model_te.pt`）：`model_quant_use_transformer_engine: true` - FourOverSix 检查点（`model_4o6.pt`）：`model_quant_use_transformer_engine: false` `setup_nvfp4_pipeline` 负责检查点加载、NVFP4 模块包装、权重物化、dtype/device 放置以及两个后端下的流式管道 VAE 搬迁——在这里使用 bf16 的 `pipe.to(...)` 快捷方式不安全，因为它会转换量化缓冲区。 python import torch from omegaconf import OmegaConf from pipeline import CausalDiffusionInferencePipeline from utils.config import normalize_config from utils.inference_utils import prepare_single_prompt_inputs, save_video, setup_nvfp4_pipeline prompt = "A compact silver robot walks through a clean robotics lab." config = normalize_config(OmegaConf.load("configs/nvfp4/inference_nvfp4.yaml")) device = torch.device("cuda") torch.set_grad_enabled(False) pipe = CausalDiffusionInferencePipeline(config, device=device) setup_nvfp4_pipeline(pipe, config, device) pipe.generator.model.eval().requires_grad_(False) noise, prompts = prepare_single_prompt_inputs(config, prompt, device) video = pipe.inference(noise=noise, text_prompts=prompts) save_video(video[0], "videos/quickstart/sample_nvfp4.mp4", fps=24) ## 模型 | 模型 | FPS ↑ | 参数量 | VBench ↑ | 多镜头 | | — | —: | —: | —: | :—: | | LongLive-1.3B（https://huggingface.co/Efficient-Large-Model/LongLive-1.3B） | 20.7 | 1.3B | 84.87 | 否 | | LongLive-2.0-5B（https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B） | 24.8 | 5B | 85.06 | ✅ | | LongLive-2.0-5B-NVFP4-4Step（https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4） | 29.7 | 5B | 84.51 | ✅ | | LongLive-2.0-5B-NVFP4-2Step（https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S2） | 45.7 | 5B | 83.14 | ✅ | ## 许可证本仓库以 Apache 2.0 许可证发布。详情请见 LICENSE。 ## 引用如果您觉得我们的工作有帮助，请考虑引用： `bibtex @article{longlive_2.0, title={LongLive2.0: An NVFP4 Parallel Infrastructure for Long Video Generation}, author={Chen, Yukang and Wang, Luozhou and Huang, Wei and Yang, Shuai and Zhang, Bohan and Xiao, Yicheng and Chu, Ruihang and Mao, Weian and Hu, Qixin and Liu, Shaoteng and Zhao, Yuyang and Mao, Huizi and Chen, Ying-Cong and Xie, Enze and Qi, Xiaojuan and Han, Song}, journal={arXiv preprint arXiv}, year={2026} }` `bibtex @inproceedings{longlive, title={Longlive: Real-time interactive long video generation}, author={Yang, Shuai and Huang, Wei and Chu, Ruihang and Xiao, Yicheng and Zhao, Yuyang and Wang, Xianbang and Li, Muyang and Xie, Enze and Chen, Yingcong and Lu, Yao and others}, booktitle={ICLR}, year={2026}, }` ## 致谢 - Self-Forcing（https://github.com/guandeh17/Self-Forcing）：我们基于其自回归训练代码库和公式。 - Wan2.2（https://github.com/Wan-Video/Wan2.2）：本版本中使用的基础视频扩散模型组件。

相似文章

LongLive-2.0：用于长视频生成的NVFP4并行基础设施

Hugging Face Daily Papers

LongLive-2.0 引入了一种基于NVFP4的并行基础设施，用于长视频生成，在训练上实现了高达2.15倍的加速，推理上实现了1.84倍的加速，5B模型达到了45.7 FPS。

@yukangchen_: 我们发布了一篇博客：“Why Video Gen Is an Infra Problem”。 https://research.nvidia.com/labs/eai/blogs/video-gen-is-an-i…

X AI KOLs Following

NVIDIA研究博客认为，长视频生成正在成为一个基础设施问题，需要在模型、内存、KV缓存、VAE解码、调度和部署基础设施上进行全栈协同设计，并以LongLive 2.0作为案例研究。

LongLive-RAG：一种通用的检索增强长视频生成框架

Hugging Face Daily Papers

LongLive-RAG将长视频生成形式化为检索增强生成问题，利用先前生成潜变量的动态记忆来减少误差积累和身份漂移，在多种自回归骨干网络上提升了生成质量。

长视频生成（阅读时间 4 分钟）

TLDR AI

本文介绍了 A²RD，这是一种利用智能体自回归扩散生成一致性长视频的新型架构。该架构提出了检索-合成-优化-更新（Retrieve–Synthesize–Refine–Update）循环机制，并推出了一个新的基准测试 LVBench-C，以解决长时视频合成中的语义漂移问题。

NVlabs/Sana

GitHub Trending (daily)

NVlabs/Sana是一个面向效率的开源代码库，用于高分辨率图像和视频生成，包含多个模型变体及训练/推理管线。