RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Hugging Face Daily Papers 05/29/26, 12:00 AM Papers

self-supervised novel-view-synthesis transformer video camera-estimation scene-reconstruction rendering

Summary

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis from real-world video, achieving clean power-law scaling and strong zero-shot performance.

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

Original Article

View Cached Full Text

Cached at: 06/01/26, 07:21 PM

Paper page - RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Source: https://huggingface.co/papers/2605.31535

Abstract

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.

Self-supervisednovel view synthesis(NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified,feed-forward transformerthat consolidatescamera estimation,scene reconstruction, andrenderinginto a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimaldynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits cleanpower-law scalingwith data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strongzero-shot open-set performancecompetitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.31535

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### CompVis/rayder Image-to-Image• Updated16 minutes ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31535 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31535 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Paper page - RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Long Video Generation (4 minute read)

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Submit Feedback

Similar Articles

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Long Video Generation (4 minute read)

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis