Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
Summary
Realiz3D introduces domain-aware learning to decouple visual domain from control signals in 3D-consistent image generation, using residual adapters and layer-specific denoising to produce photorealistic outputs from synthetic renders.
View Cached Full Text
Cached at: 05/15/26, 08:24 AM
Paper page - Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
Source: https://huggingface.co/papers/2605.13852
Abstract
Realiz3D addresses the domain gap between synthetic renders and real images in 3D-consistent image generation by decoupling visual domain from control signals through residual adapters and layer-specific denoising strategies.
We often aim to generate images that are bothphotorealisticand3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning animage generator, pre-trained on billions of real images, using renders ofsynthetic 3D assets, where annotations forcontrol signalsare available. While this approach can learn the desired controls, it often compromises the realism of the images due todomain gapbetween photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence ofcontrol signalsand the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for trainingdiffusion models, that decouples controls andvisual domain. The key idea is to explicitly learnvisual domain, real or synthetic, separately from othercontrol signalsby introducing a co-variate that, fed into smallresidual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specificvisual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers anddenoising stepsin diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks astext-to-multiview generationandtexturingfrom 3D inputs, producing outputs that are3D-consistentandphotorealistic.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.13852
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.13852 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.13852 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.13852 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D introduces a pixel-aligned 3D generation approach that improves fidelity by establishing direct pixel-to-3D correspondences through back-projection conditioning, addressing issues in canonical space generation.
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
GenRecon introduces a method for 3D scene reconstruction that integrates generative 3D priors with multi-view image conditioning, achieving high-fidelity, editable mesh reconstructions of indoor environments and outperforming existing methods by 16%.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis from real-world video, achieving clean power-law scaling and strong zero-shot performance.
JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising
JanusMesh is a fast, training-free framework that generates text-driven 3D visual illusions—a single mesh revealing different semantics from different viewing angles—by decoupling generation into cross-space dual-branch denoising and view-conditioned texture synthesis, achieving high realism in just 3-5 minutes.