CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Hugging Face Daily Papers 04/21/26, 12:00 AM Papers

Summary

CityRAG introduces a video generative model that produces long, physically grounded, 3D-consistent videos of real-world cities using geo-registered data, enabling realistic navigation and simulation for robotics and autonomous driving.

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

Original Article

View Cached Full Text

Cached at: 04/22/26, 02:41 PM

Paper page - CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Source: https://huggingface.co/papers/2604.19741

Abstract

CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as context.

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existingvideo generative modelscan produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora ofgeo-registered dataas context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieveloop closure, and navigate complex trajectories to reconstruct real-world geography.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2604\.19741

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.19741 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.19741 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.19741 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Paper page - CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

ABot-Earth 0.5: Generative 3D Earth Model

Video generation models as world simulators

Submit Feedback

Similar Articles

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

ABot-Earth 0.5: Generative 3D Earth Model

Video generation models as world simulators