RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
Summary
RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis from real-world video, achieving clean power-law scaling and strong zero-shot performance.
View Cached Full Text
Cached at: 06/01/26, 07:21 PM
Paper page - RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
Source: https://huggingface.co/papers/2605.31535
Abstract
RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.
Self-supervisednovel view synthesis(NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified,feed-forward transformerthat consolidatescamera estimation,scene reconstruction, andrenderinginto a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimaldynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits cleanpower-law scalingwith data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strongzero-shot open-set performancecompetitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.31535
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### CompVis/rayder Image-to-Image• Updated16 minutes ago • 2
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31535 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31535 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.
Long Video Generation (4 minute read)
The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam is a research paper introducing a diffusion-based framework for unified novel view synthesis that dynamically coordinates geometric and appearance priors to improve robustness against geometric errors.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.