SurGe: Improved Surface Geometry in Point Maps
Summary
SurGe introduces a Neighborhood Attention Decoder and a reformulated scale-invariant gradient matching loss to improve local surface geometry accuracy in feedforward 3D reconstruction, particularly for thin structures. It achieves state-of-the-art average rank on zero-shot monocular geometry benchmarks, with better local point map and normal metrics.
View Cached Full Text
Cached at: 06/01/26, 11:22 PM
Paper page - SurGe: Improved Surface Geometry in Point Maps
Source: https://huggingface.co/papers/2605.31577 We improve local accuracy in feedforward 3D reconstruction. Current point map models struggle with bending and oscillating artifacts for thin structures (chair legs, street lamps, etc). Easy to spot visually, but not well captured by pointwise metrics like AbsRel.
We use a Neighborhood Attention Decoder (NAD). Like DPT-style heads, it decodes point maps progressively across scales, but it replaces conv-based local mixing with neighborhood attention and window-matched RoPE in ViT-like blocks.
This gives content-dependent local mixing without full self-attention at pixel-resolution. In practice, it helps with thin structures and discontinuities, while also avoiding the patch artifacts we see with plain ViT/MLP decoders.
We also reformulate scale-invariant gradient matching for point maps. This family of losses worked best for us for when the main global error is relative. Our version keeps the pairwise scale-invariant behavior, but is directly applicable to points instead of scalar depth only.
For evaluation, we suggest a point map normal mean angular error as a complementary metric alongside global and local AbsRel. We compute normals from neighboring predicted 3D points and report the angular difference to the GT. Empirically, this matches our qualitative impression better.
On zero-shot monocular geometry benchmarks, SurGe gets the best average rank for global point map AbsRel among SotA methods. More importantly, it improves local point map and point map normal metrics, suggesting better local surface geometry. It matches what we see qualitatively.
Similar Articles
Surflo: Consistent 3D Surface Flow Model with Global State
Surflo is a feed-forward 3D reconstruction model that compresses unposed RGB views into latent tokens and decodes consistent 3D surface points via flow matching, enabling variable-resolution output and outperforming existing methods in speed.
World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible
World Tracing introduces a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing occluded surfaces. It uses a diffusion transformer trained with pixel-space flow matching, achieving strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence
This paper introduces a post-training framework that leverages 3D priors from SAM3D to improve semantic correspondence in 2D foundation features, addressing issues like left-right confusion and repeated parts. The method uses instance-specific 3D reconstruction without pose annotations or spherical geometry shortcuts.
Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction
Introduces GARD, a diffusion-based framework that operates in the feature space of a feed-forward 3D reconstructor to jointly recover scene geometry and high-quality imagery from degraded inputs.



