Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Summary
Pantheon360 introduces a 3D-aware 360° video diffusion framework that uses an explicit 3D cache to enforce geometric consistency, enabling high-fidelity digital twin generation from sparse 360° inputs.
View Cached Full Text
Cached at: 05/26/26, 02:41 AM
Paper page - Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Source: https://huggingface.co/papers/2605.25449 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
Pantheon360 enables high-fidelity 360° video generation for digital twins by combining 3D-aware diffusion with explicit geometric caching to ensure spatial-temporal consistency.
Generating complete digital twins from videos requires precise camera control, global scene coverage, and strictspatial-temporal consistencyconstraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that360° video generationoffers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: TamingDigital Twin Generationvia 3D-Aware 360° Video Diffusion, a controllable360° video generationframework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit3D Cache, reconstructed from the input, which serves as ageometric scaffoldfor any user-defined camera path. This allows the diffusion model to focus onphotorealistic texture refinementwhile the3D Cacheenforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.25449
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.25449 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.25449 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.25449 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R repurposes video diffusion transformers for dense 3D tracking from monocular video, using dual-latent representation and temporal RoPE alignment to achieve state-of-the-art performance with 1.3x faster speed and 4.6x less peak memory than prior methods.
Helix4D: Complex 4D Mesh Generation
Helix4D introduces a framework for high-quality dynamic 4D mesh generation from video by extending Trellis2 with cross-frame attention and a 4D temporal encoding that repurposes redundant spatial RoPE bands without adding parameters.
Unified Panoramic Geometry Estimation via Multi-View Foundation Models
PaGeR adapts the multi-view perspective foundation model Depth Anything 3 to predict scale-invariant and metric depth, surface normals, and sky segmentation from a single equirectangular image, using a fixed cubemap representation that keeps VRAM and runtime constant. The paper also releases the ZüriPano and PanoInfinigen datasets.
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
SwanSphere proposes a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies, achieving superior performance in both video-to-spatial and text-to-spatial audio tasks.