World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible
Summary
World Tracing introduces a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing occluded surfaces. It uses a diffusion transformer trained with pixel-space flow matching, achieving strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks.
View Cached Full Text
Cached at: 06/15/26, 04:59 PM
Paper page - World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible
Source: https://huggingface.co/papers/2606.13652
Abstract
World Tracing introduces a generative pixel-aligned geometry representation that predicts 3D points aligned with input pixels while completing hidden surfaces, using a diffusion transformer trained with pixel-space flow matching.
Image-to-3D methods often trade off faithfulness and completeness:depth estimatorsare anchored to input pixels but stop at the visible surface, whileimage-to-3D modelsgenerate complete shapes that are often misaligned with the input. We introduceWorld Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel,World Tracingpredicts an ordered stack ofcamera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracingdiffusion transformer,WT-DiT, which treats multiple geometry layers as separatedenoising tokenscoupled through factorized andglobal attention.WT-DiTis trained withpixel-space flow matchingand a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation.World Tracingachieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves2D-to-3D correspondence, enablingtext-driven 3D scene editing, geometry-conditionednovel-view video synthesis, and training-free integration withtextured-mesh generators.
View arXiv pageView PDFProject pageGitHub186Add to collection
Get this paper in your agent:
hf papers read 2606\.13652
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper4
#### haoz19/object-model-6layer Image-to-3D• Updated4 days ago • 8
#### haoz19/scene-model-6layer Image-to-3D• Updated4 days ago • 5
#### haoz19/scene-model-6layer-840 Image-to-3D• Updated1 day ago • 4
#### haoz19/dynamic-model-16frame Image-to-3D• Updated4 days ago • 3
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.13652 in a dataset README.md to link it from this page.
Spaces citing this paper1
Collections including this paper1
Similar Articles
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D introduces a pixel-aligned 3D generation approach that improves fidelity by establishing direct pixel-to-3D correspondences through back-projection conditioning, addressing issues in canonical space generation.
SurGe: Improved Surface Geometry in Point Maps
SurGe introduces a Neighborhood Attention Decoder and a reformulated scale-invariant gradient matching loss to improve local surface geometry accuracy in feedforward 3D reconstruction, particularly for thin structures. It achieves state-of-the-art average rank on zero-shot monocular geometry benchmarks, with better local point map and normal metrics.
World Machine: Towards Generative World Modeling for Time-Series
World Machine proposes a transformer-based generative world modeling architecture for time series that uses latent states to adapt to varying context lengths, addressing the quadratic memory cost of traditional transformers. Experiments on a synthetic dataset validate its feasibility and show improvements over conventional transformers.
Geometry-Aware Image Flow Matching
This paper introduces geometry-aware flow matching for natural images by treating them as points on a hypersphere, proposing SOT-CFM and SFM methods that improve generative modeling by leveraging the spherical structure of image data.
TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction
TriSplat is a feed-forward 3D reconstruction network that uses oriented triangle primitives to directly generate simulation-ready meshes from single images, bypassing expensive post-processing steps. It achieves geometry-faithful reconstructions while maintaining competitive novel-view rendering quality.