Geometric Context Transformer for Streaming 3D Reconstruction
Summary
Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.
View Cached Full Text
Cached at: 05/08/26, 08:43 AM
Paper page - Geometric Context Transformer for Streaming 3D Reconstruction
Source: https://huggingface.co/papers/2604.14141
Abstract
LingBot-Map is a feed-forward 3D foundation model that reconstructs scenes from video streams using a geometric context transformer architecture with specialized attention mechanisms for coordinate grounding, dense geometric cues, and long-range drift correction, achieving stable real-time performance at 20 FPS.
Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles ofSimultaneous Localization and Mapping(SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon ageometric context transformer(GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designedattention mechanism, which integrates ananchor context, apose-reference window, and atrajectory memoryto addresscoordinate grounding,dense geometric cues, andlong-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.
View arXiv pageView PDFProject pageGitHub5.91kAdd to collection
Get this paper in your agent:
hf papers read 2604\.14141
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper3
#### robbyant/lingbot-map Updated12 days ago • 195
#### agramoi/lingbot-map
#### maujim/lingbot-map-long-only Updated4 days ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.14141 in a dataset README.md to link it from this page.
Spaces citing this paper5
Collections including this paper1
Similar Articles
robbyant/lingbot-map
LingBot-Map is a feed-forward 3D foundation model for streaming 3D reconstruction that uses a Geometric Context Transformer architecture, achieving state-of-the-art performance with efficient ~20 FPS inference on long sequences exceeding 10,000 frames.
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.
@FinanceYF5: This AI is impressive. LingBot-Map can convert real-time video streams into real-time 3D reconstruction. 20 FPS code + model
LingBot-Map is an AI model capable of converting real-time video streams into real-time 3D reconstruction, running at 20 FPS with complete code and model provided.
D4RT: Teaching AI to see the world in four dimensions
DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.
TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos
This paper introduces TT4D, a novel pipeline and large-scale dataset for reconstructing table tennis gameplay in 4D from monocular videos. It features a unique lift-first approach that estimates 3D ball trajectories and spin before time segmentation, enabling robust reconstruction even with occlusions.