Latent Spatial Memory for Video World Models
Summary
This paper introduces latent spatial memory for video world models, storing 3D scene information directly in diffusion latent space to avoid costly pixel-space reconstruction. The proposed Mirage framework achieves up to 10.57x faster generation and 55x memory reduction while achieving state-of-the-art performance on WorldScore and RealEstate10K.
View Cached Full Text
Cached at: 06/09/26, 08:43 AM
Paper page - Latent Spatial Memory for Video World Models
Source: https://huggingface.co/papers/2606.09828
Abstract
Latent spatial memory for video world models stores 3D scene information directly in diffusion latent space, eliminating pixel-space reconstruction overhead and achieving faster generation with reduced memory usage.
Video world modelsthat maintain 3D spatial consistency across generated frames typically rely on explicitpoint cloud memoryconstructed inRGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introducelatent spatial memoryforvideo world models, a persistent 3D cache that stores scene information directly in thediffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D viadepth-guided back-projectionand queries it by synthesizing novel views through directlatent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show thatlatent spatial memoryachieves up to 10.57times fasterend-to-end video generationand 55times reduction inmemory footprintrelative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance onWorldScoreand strong reconstruction quality onRealEstate10K.
View arXiv pageView PDFProject pageGitHub14Add to collection
Get this paper in your agent:
hf papers read 2606\.09828
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.09828 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.09828 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.09828 in a Space README.md to link it from this page.
Collections including this paper3
Similar Articles
@HuggingPapers: Microsoft Research introduces Mirage Latent spatial memory stores 3D scenes directly as latent tokens, skipping the cos…
Microsoft Research introduces Mirage, a latent spatial memory that stores 3D scenes as latent tokens, achieving up to 10.57x faster video generation and 55x lower memory use with state-of-the-art consistency.
Composition of Memory Experts for Diffusion World Models
A new diffusion-based world model framework that uses a composition of specialized memory experts (short-term, long-term episodic, and spatial) to achieve better temporal consistency and long-context modeling without quadratic cost.
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
This paper introduces MBench, a benchmark for evaluating the memory capabilities of video world models across entity, environment, and causal consistency over long temporal horizons.
@HaochengXiUCB: New blog post: The Forgetting Wall in Video and World Models Long-horizon video generation is not just limited by compu…
This blog post introduces the concept of the 'Forgetting Wall' in long-horizon video generation and world models, arguing that the primary bottleneck is memory (KV cache growth) rather than compute, and explores compression as a key direction for future models.
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA replaces per-head KV caches in video diffusion models with a shared low-rank latent and decoupled 3D-RoPE positional keys, reducing per-token KV memory by 92.7% and improving throughput by 1.23x on a B200 while maintaining quality on VBench benchmarks.