RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Hugging Face Daily Papers Papers

Summary

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis from real-world video, achieving clean power-law scaling and strong zero-shot performance.

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder
Original Article
View Cached Full Text

Cached at: 06/01/26, 07:21 PM

Paper page - RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Source: https://huggingface.co/papers/2605.31535

Abstract

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.

Self-supervisednovel view synthesis(NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified,feed-forward transformerthat consolidatescamera estimation,scene reconstruction, andrenderinginto a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimaldynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits cleanpower-law scalingwith data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strongzero-shot open-set performancecompetitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.31535

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### CompVis/rayder Image-to-Image• Updated16 minutes ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31535 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31535 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Hugging Face Daily Papers

Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.

Long Video Generation (4 minute read)

TLDR AI

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.