GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
Summary
GenRecon introduces a method for 3D scene reconstruction that integrates generative 3D priors with multi-view image conditioning, achieving high-fidelity, editable mesh reconstructions of indoor environments and outperforming existing methods by 16%.
View Cached Full Text
Cached at: 05/25/26, 02:35 AM
Paper page - GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
Source: https://huggingface.co/papers/2605.23888
Abstract
A novel method for 3D scene reconstruction that integrates generative 3D priors with multi-view image conditioning to produce high-fidelity, editable mesh reconstructions of indoor environments.
We introduce a new approach to high-fidelity3D scene reconstructionfrom multi-view RGB images that tightly couples reconstruction with a stronggenerative 3D prior. We cast scene reconstruction asconditional 3D generationover a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we useTrellis.2as an example -- which we generalize to the scene level. To this end, we propose aprojection-based conditioning mechanismthat lifts posedmulti-view image featuresinto acoherent 3D representationaligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior ofTrellis.2to multi-view, scene-scale generation, producing faithful, editablePBR mesh reconstructionsof indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.23888
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.23888 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.23888 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.23888 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.
Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction
Introduces GARD, a diffusion-based framework that operates in the feature space of a feed-forward 3D reconstructor to jointly recover scene geometry and high-quality imagery from degraded inputs.
Unified Panoramic Geometry Estimation via Multi-View Foundation Models
PaGeR adapts the multi-view perspective foundation model Depth Anything 3 to predict scale-invariant and metric depth, surface normals, and sky segmentation from a single equirectangular image, using a fixed cubemap representation that keeps VRAM and runtime constant. The paper also releases the ZüriPano and PanoInfinigen datasets.
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
Sat3DGen introduces a geometry-first approach for generating street-level 3D scenes from a single satellite image, achieving improved geometric accuracy and photorealism through novel constraints and training strategies. The method demonstrates significant improvements over prior work on the VIGOR-OOD benchmark.
VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors
VidSplat is a training-free generative reconstruction framework that uses video diffusion priors to recover complete 3D scenes from sparse inputs by synthesizing novel views.