CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
Summary
CityRAG introduces a video generative model that produces long, physically grounded, 3D-consistent videos of real-world cities using geo-registered data, enabling realistic navigation and simulation for robotics and autonomous driving.
View Cached Full Text
Cached at: 04/22/26, 02:41 PM
Paper page - CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
Source: https://huggingface.co/papers/2604.19741
Abstract
CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as context.
We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existingvideo generative modelscan produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora ofgeo-registered dataas context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieveloop closure, and navigate complex trajectories to reconstruct real-world geography.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2604\.19741
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.19741 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.19741 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.19741 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
LongLive-RAG formulates long video generation as a retrieval-augmented generation problem, using a dynamic memory of previously generated latents to reduce error accumulation and identity drift, achieving improved quality across multiple autoregressive backbones.
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
Sat3DGen introduces a geometry-first approach for generating street-level 3D scenes from a single satellite image, achieving improved geometric accuracy and photorealism through novel constraints and training strategies. The method demonstrates significant improvements over prior work on the VIGOR-OOD benchmark.
Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
This paper introduces V-RAGBench, a benchmark for evaluating retrieval-augmented generation over long egocentric videos, and CARVE, a method that adaptively selects retrieval configurations per chunk to improve VideoRAG performance.
ABot-Earth 0.5: Generative 3D Earth Model
ABot-Earth 0.5 is a generative 3D framework that synthesizes realistic 3D urban environments from satellite imagery using 3D Gaussian Splatting, enabling real-time visualization and closed-loop UAV navigation at low cost.
Video generation models as world simulators
OpenAI's technical report on Sora describes a video generation model that unifies diverse visual data through visual patches, enabling large-scale training of generative models capable of producing high-definition videos up to one minute long across variable durations, aspect ratios, and resolutions.