Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Summary
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
Paper page - Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Source: https://huggingface.co/papers/2604.11707
Abstract
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations and then using them to guide photorealistic visual synthesis, addressing train-test mismatches through specialized conditioning strategies.
Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix
View arXiv page (https://arxiv.org/abs/2604.11707)View PDF (https://arxiv.org/pdf/2604.11707)GitHub8 (https://github.com/Sta8is/Re2Pix)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.11707)
Community
Paper submitter
3 days ago (https://huggingface.co/papers/2604.11707#69e1f27e67ed2fdf660b12c4)
Pixel or latent world models?
Video world models fall into two camps: • generate photorealistic frames • predict semantic features of the future (e.g., DINOv2)
Why choose one?
We introduce Re2Pix, a hierarchical approach that combines both. combined_video_60 (https://cdn-uploads.huggingface.co/production/uploads/677272184d148b904333e874/7oF3pYvDvaEVo5UVIUgMs.gif) combined_video_228 (https://cdn-uploads.huggingface.co/production/uploads/677272184d148b904333e874/itFW8_VbKd4J9yP2_v34i.gif)
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2604.11707
Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.11707 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.11707 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.11707 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.
Similar Articles
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
Introduces Future-L1, an interleaved latent visual reasoning framework that improves video event prediction by maintaining visual semantics in latent space. Achieves state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.
MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation
The paper introduces MilliVid, a method for improving long-range consistency in video generation by using a multi-scale autoencoder to compress frames into hierarchical tokens and then generating them with a coarse-to-fine diffusion model, outperforming baselines on Minecraft videos.
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis from real-world video, achieving clean power-law scaling and strong zero-shot performance.
Memento: Reconstruct to Remember for Consistent Long Video Generation
Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms, achieving state-of-the-art performance in long-term subject consistency and cross-shot coherence.