Training Video Foundation Models with NVIDIA NeMo
Summary
This paper presents a scalable open-source pipeline using NVIDIA NeMo for training and inference of Video Foundation Models, addressing challenges in generating high-quality videos with accelerated dataset curation and parallelized training.
View Cached Full Text
Cached at: 06/29/26, 12:23 PM
Paper page - Training Video Foundation Models with NVIDIA NeMo
Source: https://huggingface.co/papers/2503.12964 Published on Mar 17, 2025
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A scalable open-source pipeline using NVIDIA NeMo for training and inference of Video Foundation Models addresses challenges in generating high-quality videos.
Video Foundation Models(VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high qualityVFMs that can generate high-quality videos. We present a scalable, open-sourceVFMtraining pipeline withNVIDIA NeMo, providing accelerated video dataset curation,multimodal data loading, and parallelizedvideo diffusion modeltraining and inference. We also provide a comprehensiveperformance analysishighlighting best practices for efficientVFMtraining and inference.
View arXiv pageView PDFGitHub17.6kAdd to collection
Get this paper in your agent:
hf papers read 2503\.12964
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2503.12964 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2503.12964 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2503.12964 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel
NVIDIA NeMo AutoModel leverages HuggingFace Transformers v5 to deliver 3.4-3.7x higher training throughput and 29-32% less GPU memory for fine-tuning Mixture-of-Experts models, with no code changes beyond a single import.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
LongLive-2.0 introduces an NVFP4-based parallel infrastructure for long video generation, achieving up to 2.15x training speedup and 1.84x inference speedup with a 5B model reaching 45.7 FPS.
NVlabs/Sana
NVlabs/Sana is an efficiency-oriented open-source codebase for high-resolution image and video generation, including multiple model variants and training/inference pipelines.
@yukangchen_: We released a blog on "Why Video Gen Is an Infra Problem". https://research.nvidia.com/labs/eai/blogs/video-gen-is-an-i…
NVIDIA research blog argues that long video generation is becoming an infrastructure problem requiring full-stack co-design across models, memory, KV cache, VAE decoding, scheduling, and deployment, using LongLive 2.0 as a case study.
@HuggingPapers: NVIDIA just released AnyFlow on Hugging Face The first any-step video diffusion model that generates high-quality text-…
NVIDIA released AnyFlow, the first any-step video diffusion model for text-to-video generation, allowing smooth quality scaling across inference budgets (4 to 50 steps).