Geometric Context Transformer for Streaming 3D Reconstruction

Papers with Code Trending 04/15/26, 12:00 AM Papers

3d-reconstruction streaming-3d transformer slam computer-vision foundation-model

Summary

Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

Original Article

View Cached Full Text

Cached at: 05/08/26, 08:43 AM

Paper page - Geometric Context Transformer for Streaming 3D Reconstruction

Source: https://huggingface.co/papers/2604.14141

Abstract

LingBot-Map is a feed-forward 3D foundation model that reconstructs scenes from video streams using a geometric context transformer architecture with specialized attention mechanisms for coordinate grounding, dense geometric cues, and long-range drift correction, achieving stable real-time performance at 20 FPS.

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles ofSimultaneous Localization and Mapping(SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon ageometric context transformer(GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designedattention mechanism, which integrates ananchor context, apose-reference window, and atrajectory memoryto addresscoordinate grounding,dense geometric cues, andlong-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

View arXiv page View PDF Project page GitHub5.91k Add to collection

Get this paper in your agent:

hf papers read 2604\.14141

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash