Geometric Context Transformer for Streaming 3D Reconstruction
Summary
Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.
View Cached Full Text
Cached at: 05/08/26, 08:43 AM
Paper page - Geometric Context Transformer for Streaming 3D Reconstruction
Source: https://huggingface.co/papers/2604.14141
Abstract
LingBot-Map is a feed-forward 3D foundation model that reconstructs scenes from video streams using a geometric context transformer architecture with specialized attention mechanisms for coordinate grounding, dense geometric cues, and long-range drift correction, achieving stable real-time performance at 20 FPS.
Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles ofSimultaneous Localization and Mapping(SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon ageometric context transformer(GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designedattention mechanism, which integrates ananchor context, apose-reference window, and atrajectory memoryto addresscoordinate grounding,dense geometric cues, andlong-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.
View arXiv pageView PDFProject pageGitHub5.91kAdd to collection
Get this paper in your agent:
hf papers read 2604\.14141
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper3
#### robbyant/lingbot-map Updated12 days ago • 195
#### agramoi/lingbot-map
#### maujim/lingbot-map-long-only Updated4 days ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.14141 in a dataset README.md to link it from this page.
Spaces citing this paper5
Collections including this paper1
Similar Articles
robbyant/lingbot-map
LingBot-Map is a feed-forward 3D foundation model for streaming 3D reconstruction that uses a Geometric Context Transformer architecture, achieving state-of-the-art performance with efficient ~20 FPS inference on long sequences exceeding 10,000 frames.
@IlirAliu_: Forget lidar. One single camera. Runs in real time & is open source: A streaming 3D model that reconstructs scenes live…
LingBot-Map is an open-source, real-time streaming 3D reconstruction model that uses a single camera, running at ~20 FPS via a feed-forward geometric context transformer, outperforming both streaming and offline methods.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.