nvidia/Lyra-2.0

Hugging Face Models Trending Models

Summary

Lyra 2.0 is NVIDIA's framework for generating persistent, explorable 3D worlds from a single image, combining long-range video synthesis with explicit 3D reconstruction while addressing spatial forgetting and temporal drifting through novel training techniques.

Tags: arxiv:2604.13036, region:us
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:44 PM

nvidia/Lyra-2.0 · Hugging Face

Source: https://huggingface.co/nvidia/Lyra-2.0

https://huggingface.co/nvidia/Lyra-2.0#lyra-20-explorable-generative-3d-worldsLyra 2.0: Explorable Generative 3D Worlds

Paper,Project Page

Tianchang Shen*,Sherwin Bahmani,Kai He,Sangeetha Grama Srinivasan,Tianshi Cao,Jiawei Ren,Ruilong Li,Zian Wang,Nicholas Sharp,Zan Gojcic,Sanja Fidler,Jiahui Huang,Huan Ling,Jun Gao,Xuanchi Ren*

* Equal Contribution

https://huggingface.co/nvidia/Lyra-2.0#descriptionDescription:

Lyra 2.0 is a framework for generating persistent, explorable 3D worlds at scale from a single image. The framework consists of two key components: first, it synthesizes a long-range video with strong global geometric consistency; second, it reconstructs the generated sequence into an explicit 3D representation. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing—retrieving relevant past frames and establishing dense correspondences with the target viewpoints—while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. This two-stage design enables scalable, spatially persistent scene generation while supporting real-time rendering. Lyra 2.0 establishes a new state of the art in single-image 3D scene generation.

This model is ready for internal scientific research and development use.

https://huggingface.co/nvidia/Lyra-2.0#licenseterms-of-useLicense/Terms of Use

This model is released under theNVIDIA Internal Scientific Research and Development Model License.

Important Note: The Model and any Derivative Model may not be distributed, deployed, sublicensed, publicly displayed, publicly performed, or sublicensed. You may not use the Model or a Derivative Model in a production environment or for the purpose of generating works for sale or distribution. If you fail to comply with any of the terms in this Agreement, your rights under the NVIDIA Internal Scientific Research and Development Model License will automatically terminate.

https://huggingface.co/nvidia/Lyra-2.0#deployment-geographyDeployment Geography:

Global

https://huggingface.co/nvidia/Lyra-2.0#use-case-Use Case:

This model is intended for researchers developing 3D world model techniques, and it allows them to generate a 3D scene from a single image.

https://huggingface.co/nvidia/Lyra-2.0#release-date–Release Date:

Github 04/14/2026 viahttps://github.com/nv-tlabs/lyra/tree/main/Lyra-2

https://huggingface.co/nvidia/Lyra-2.0#referencessReferences(s):

Lyra 2.0: Explorable Generative 3D Worlds

Paper,Project Page

https://huggingface.co/nvidia/Lyra-2.0#model-architectureModel Architecture:

**Architecture Type:**Convolutional Neural Network (CNN), Transformer **Network Architecture:**Transformer

This model was developed based onWAN-14B. Number of model parameters: 14B

https://huggingface.co/nvidia/Lyra-2.0#input-Input:

**Input Type(s):**Camera Parameters, Image **Input Format(s):**One-Dimensional (1D) Array of Camera Poses, Two-Dimensional (2D) Array of Images. **Input Parameters:**Camera Poses (1D), Images (2D) **Other Properties Related to Input:**The input image should be 480 * 832 resolution, and we recommend using 81 frames for the camera parameters.

https://huggingface.co/nvidia/Lyra-2.0#output-Output:

**Output Type(s):**Three-Dimensional (3D) Gaussian Scene **Output Format:**Point cloud file (e.g., .ply) **Output Parameters:**A set of 3D Gaussians, where each Gaussian is defined by a collection of attributes. **Other Properties Related to Output:**The output is not a sequence of 2D images but a set of 3D primitives used to render a scene. For each of the M Gaussians, the key properties are:

  • Position (Mean): A 3D vector (x,y,z) defining the center of the Gaussian in 3D space.
  • Covariance (Shape & Orientation): This defines the ellipsoid’s shape and rotation. It’s typically stored as a 3D scale vector (s_x, s_y, s_z) and a 4D rotation quaternion (r_w, r_x, r_y, r_z).
  • Color: A 3-vector (R,G,B) representing the color of the Gaussian. This can also be represented by more complex Spherical Harmonics (SH) coefficients for view-dependent color effects.
  • Opacity: A scalar value (α) that controls the transparency of the Gaussian.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems H100 and GB200. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

https://huggingface.co/nvidia/Lyra-2.0#software-integrationSoftware Integration:

Runtime Engine(s):

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper

Preferred/Supported Operating System(s):

  • [Linux]

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

https://huggingface.co/nvidia/Lyra-2.0#model-versionsModel Version(s):

-V1.0

https://huggingface.co/nvidia/Lyra-2.0#training-testing-and-evaluation-datasetsTraining, Testing, and Evaluation Datasets:

https://huggingface.co/nvidia/Lyra-2.0#training-datasetTraining Dataset:

  • Open-domain video-text corpora (research use only)

Data Modality: Text, Video, Depth of Video

Video Training Data Size:

  • Less than 10,000 Hours

Data Collection Method by dataset:

  • Hybrid: Synthetic, Automated, Human

Labeling Method by dataset:

  • Synthetic, Automated, Human

Properties:

  • Modalities: 100k image frames and text pair with 3D annotations
  • Nature of the content: World exploration data
  • Linguistic characteristics: Natural Language

https://huggingface.co/nvidia/Lyra-2.0#testing-datasetTesting Dataset:

  • Open-domain video-text corpora (research use only)

Data Modality: Text, Video, Depth of Video

Video Training Data Size:

  • Less than 10,000 Hours

Data Collection Method by dataset:

  • Hybrid: Synthetic, Automated, Human

Labeling Method by dataset:

  • Synthetic, Automated, Human

Properties:

  • Modalities: 1k image frames and text pair with 3D annotations
  • Nature of the content: World exploration data
  • Linguistic characteristics: Natural Language

https://huggingface.co/nvidia/Lyra-2.0#evaluation-datasetEvaluation Dataset:

  • Open-domain video-text corpora (research use only)

Data Modality: Text, Video, Depth of Video

Video Training Data Size:

  • Less than 10,000 Hours

Data Collection Method by dataset:

  • Hybrid: Synthetic, Automated, Human

Labeling Method by dataset:

  • Synthetic, Automated, Human

Properties:

  • Modalities: 1k image frames and text pair with 3D annotations
  • Nature of the content: World exploration data
  • Linguistic characteristics: Natural Language

https://huggingface.co/nvidia/Lyra-2.0#inferenceInference:

Acceleration Engine:WAN-2.1Test Hardware:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper

https://huggingface.co/nvidia/Lyra-2.0#computational-loadComputational Load:

The model is trained on 32 nodes of H100 for 4000 iterations. The estimated training token consumption is ~24 billion.

https://huggingface.co/nvidia/Lyra-2.0#ethical-considerationsEthical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concernshere.

https://huggingface.co/nvidia/Lyra-2.0#plus-plus–promisePlus Plus (++) Promise

We value you, the datasets, the diversity they represent, and what we have been entrusted with. This model and its associated data have been:

  • Verified to comply with current applicable disclosure laws, regulations, and industry standards.
  • Verified to comply with applicable privacy labeling requirements.
  • Annotated to describe the collector/source (NVIDIA or a third-party).
  • Characterized for technical limitations.
  • Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests.
  • Reviewed before release.
  • Tagged for known restrictions and potential safety implications.

https://huggingface.co/nvidia/Lyra-2.0#biasBias

FieldResponseParticipation considerations from adversely impacted groupsprotected classesin model design and testing:NoneMeasures taken to mitigate against unwanted bias:None

https://huggingface.co/nvidia/Lyra-2.0#explainabilityExplainability

FieldResponseIntended Task/Domain:Novel view synthesis, video generationModel Type:TransformerIntended Users:Physical AI developers.Output:Three-Dimensional (3D) Gaussian Scene.Describe how the model works:We take a single image as input and synthesize a long-range video with global geometric consistency using a WAN-14B-based model. The generated video is then reconstructed into an explicit 3D Gaussian representation for real-time rendering.Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:Not Applicable.Technical Limitations & Mitigation:The proposed method relies on synthetic data for training, which might limit the generalization ability if the target scenario is not in the pre-generated dataset.Verified to have met prescribed NVIDIA quality standards:YesPerformance Metrics:Qualitative and Quantitative Evaluation including PSNR, SSIM, LPIPS metrics.Potential Known Risks:This model is trained on synthetic data, and may inaccurately reconstruct an out-of-distribution video that is not in the synthetic data domain.Licensing:NVIDIA Internal Scientific Research and Development Model License

https://huggingface.co/nvidia/Lyra-2.0#privacyPrivacy

FieldResponseGeneratable or reverse engineerable personal data?NoPersonal data used to create this model?[None Known]Is there provenance for all datasets used in training?YesHow often is dataset reviewed?Before ReleaseDoes data labeling (annotation, metadata) comply with privacy laws?Not ApplicableIs data compliant with data subject requests for data correction or removal, if such a request was made?No, not possible with externally-sourced data.Applicable Privacy Policyhttps://www.nvidia.com/en-us/about-nvidia/privacy-policy/

https://huggingface.co/nvidia/Lyra-2.0#safetySafety

FieldResponseModel Application Field(s):World GenerationDescribe the life critical impact (if present).Not Applicable Use Case Restrictions:Abide byNVIDIA Internal Scientific Research and Development Model LicenseModel and dataset restrictions:The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

https://huggingface.co/nvidia/Lyra-2.0#citationCitation

@article{shen2026lyra2,
  title={Lyra 2.0: Explorable Generative 3D Worlds},
  author={Shen, Tianchang and Bahmani, Sherwin and He, Kai and Srinivasan, Sangeetha Grama and Cao, Tianshi and Ren, Jiawei and Li, Ruilong and Wang, Zian and Sharp, Nicholas and Gojcic, Zan and Fidler, Sanja and Huang, Jiahui and Ling, Huan and Gao, Jun and Ren, Xuanchi},
  journal={arXiv preprint arXiv:2604.13036},
  year={2026}
}

Similar Articles

Introducing Lyria 3 Pro

YouTube AI Channels

Google DeepMind introduced Lyria 3 Pro through a promotional video featuring background music, without technical details.

nvidia/Cosmos3-Super-Image2Video

Hugging Face Models Trending

NVIDIA releases Cosmos3-Super-Image2Video, a model that generates temporally coherent video sequences from an input image and text instructions, part of the Cosmos 3 omnimodal world model platform for Physical AI applications.

nvidia/Cosmos3-Super

Hugging Face Models Trending

NVIDIA released Cosmos3, a collection of omnimodal world foundation models for Physical AI, capable of generating video, image, audio, and action commands from various inputs, with versions for different tasks like policy learning and image-to-video generation.

nvidia/Cosmos3-Nano

Hugging Face Models Trending

NVIDIA releases Cosmos3-Nano, an omnimodal world model for Physical AI that generates video, image, audio, and action commands from text, image, video, and action inputs, targeting robotics, autonomous driving, and smart space applications.