@IlirAliu_: Forget lidar. One single camera. Runs in real time & is open source: A streaming 3D model that reconstructs scenes live…
Summary
LingBot-Map is an open-source, real-time streaming 3D reconstruction model that uses a single camera, running at ~20 FPS via a feed-forward geometric context transformer, outperforming both streaming and offline methods.
View Cached Full Text
Cached at: 06/28/26, 06:02 AM
Forget lidar. One single camera. Runs in real time & is open source:
A streaming 3D model that reconstructs scenes live, at ~20 FPS, over long sequences.
End-to-end.
Optimization tricks, cleanup steps?
Nope.
And it beats both streaming and even some offline methods.
Perception is becoming software-first.
Closer to machines that see and understand the world as it unfolds.
Thanks for sharing, @YinghaoXu1
Models: https://huggingface.co/robbyant/lingbot-map… Project page: https://technology.robbyant.com/lingbot-map Code: https://github.com/Robbyant/lingbot-map… Paper: https://arxiv.org/abs/2604.14141
——
Weekly robotics and AI insights. Subscribe free: http://22astronauts.com
robbyant/lingbot-map · Hugging Face
Source: https://huggingface.co/robbyant/lingbot-map

LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction
Robbyant Team
https://github.com/user-attachments/assets/fe39e095-af2c-4ec9-b68d-a8ba97e505ab
https://huggingface.co/robbyant/lingbot-map#%F0%9F%97%BA%EF%B8%8F-meet-lingbot-map-weve-built-a-feed-forward-3d-foundation-model-for-streaming-3d-reconstruction-%F0%9F%8F%97%EF%B8%8F%F0%9F%8C%8D🗺️ Meet LingBot-Map! We’ve built a feed-forward 3D foundation model for streaming 3D reconstruction! 🏗️🌍
LingBot-Map has focused on:
- Geometric Context Transformer: Architecturally unifies coordinate grounding, dense geometric cues, and long-range drift correction within a single streaming framework through anchor context, pose-reference window, and trajectory memory.
- High-Efficiency Streaming Inference: A feed-forward architecture with paged KV cache attention, enabling stable inference at ~20 FPS on 518×378 resolution over long sequences exceeding 10,000 frames.
- State-of-the-Art Reconstruction: Superior performance on diverse benchmarks compared to both existing streaming and iterative optimization-based approaches.
https://huggingface.co/robbyant/lingbot-map#%E2%9A%99%EF%B8%8F-quick-start⚙️ Quick Start
https://huggingface.co/robbyant/lingbot-map#installationInstallation
1. Create conda environment
conda create -n lingbot-map python=3.10 -y
conda activate lingbot-map
2. Install PyTorch (CUDA 12.8)
pip install torch==2.9.1 torchvision==0.24.1 --index-url https://download.pytorch.org/whl/cu128
For other CUDA versions, seePyTorch Get Started.
3. Install lingbot-map
pip install -e .
4. Install FlashInfer (recommended)
FlashInfer provides paged KV cache attention for efficient streaming inference:
# CUDA 12.8 + PyTorch 2.9
pip install flashinfer-python -i https://flashinfer.ai/whl/cu128/torch2.9/
For other CUDA/PyTorch combinations, seeFlashInfer installation. If FlashInfer is not installed, the model falls back to SDPA (PyTorch native attention) via
\-\-use\_sdpa.
5. Visualization dependencies (optional)
pip install -e ".[vis]"
https://huggingface.co/robbyant/lingbot-map#%F0%9F%93%A6-model-download📦 Model Download
https://huggingface.co/robbyant/lingbot-map#%F0%9F%8E%AC-demo🎬 Demo
https://huggingface.co/robbyant/lingbot-map#streaming-inference-from-imagesStreaming Inference from Images
python demo.py --model_path /path/to/checkpoint.pt \
--image_folder /path/to/images/
https://huggingface.co/robbyant/lingbot-map#streaming-inference-from-videoStreaming Inference from Video
python demo.py --model_path /path/to/checkpoint.pt \
--video_path video.mp4 --fps 10
https://huggingface.co/robbyant/lingbot-map#streaming-with-keyframe-intervalStreaming with Keyframe Interval
Use\-\-keyframe\_intervalto reduce KV cache memory by only keeping every N-th frame as a keyframe. Non-keyframe frames still produce predictions but are not stored in the cache. This is useful for long sequences which excesses 320 frames.
python demo.py --model_path /path/to/checkpoint.pt \
--image_folder /path/to/images/ --keyframe_interval 6
https://huggingface.co/robbyant/lingbot-map#windowed-inference-for-long-sequences-3000-framesWindowed Inference (for long sequences, >3000 frames)
python demo.py --model_path /path/to/checkpoint.pt \
--video_path video.mp4 --fps 10 \
--mode windowed --window_size 64
https://huggingface.co/robbyant/lingbot-map#sky-maskingSky Masking
Sky masking uses an ONNX sky segmentation model to filter out sky points from the reconstructed point cloud, which improves visualization quality for outdoor scenes.
Setup:
# Install onnxruntime (required)
pip install onnxruntime # CPU
# or
pip install onnxruntime-gpu # GPU (faster for large image sets)
The sky segmentation model (skyseg\.onnx) will be automatically downloaded fromHuggingFaceon first use.
Usage:
python demo.py --model_path /path/to/checkpoint.pt \
--image_folder /path/to/images/ --mask_sky
Sky masks are cached in<image\_folder\>\_sky\_masks/so subsequent runs skip regeneration.
https://huggingface.co/robbyant/lingbot-map#without-flashinfer-sdpa-fallbackWithout FlashInfer (SDPA fallback)
python demo.py --model_path /path/to/checkpoint.pt \
--image_folder /path/to/images/ --use_sdpa
https://huggingface.co/robbyant/lingbot-map#%F0%9F%93%9C-license📜 License
This project is released under the Apache License 2.0. SeeLICENSEfile for details.
https://huggingface.co/robbyant/lingbot-map#%F0%9F%93%96-citation📖 Citation
@article{chen2026geometric,
title={Geometric Context Transformer for Streaming 3D Reconstruction},
author={Chen, Lin-Zhuo and Gao, Jian and Chen, Yihang and Cheng, Ka Leong and Sun, Yipengjing and Hu, Liangxiao and Xue, Nan and Zhu, Xing and Shen, Yujun and Yao, Yao and Xu, Yinghao},
journal={arXiv preprint arXiv:2604.14141},
year={2026}
}
https://huggingface.co/robbyant/lingbot-map#%E2%9C%A8-acknowledgments✨ Acknowledgments
We thank Shangzhan Zhang, Jianyuan Wang, Yudong Jin, Christian Rupprecht, and Xun Cao for their helpful discussions and support.
This work builds upon several excellent open-source projects:
Similar Articles
robbyant/lingbot-map
LingBot-Map is a feed-forward 3D foundation model for streaming 3D reconstruction that uses a Geometric Context Transformer architecture, achieving state-of-the-art performance with efficient ~20 FPS inference on long sequences exceeding 10,000 frames.
Geometric Context Transformer for Streaming 3D Reconstruction
Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.
@FinanceYF5: This AI is impressive. LingBot-Map can convert real-time video streams into real-time 3D reconstruction. 20 FPS code + model
LingBot-Map is an AI model capable of converting real-time video streams into real-time 3D reconstruction, running at 20 FPS with complete code and model provided.
We’re proud to open-source LIDARLearn [R] [D] [P]
LIDARLearn is an open-source PyTorch library for 3D point cloud deep learning that unifies 56 pre-configured models with built-in cross-validation and automatic publication-ready LaTeX report generation. The framework supports supervised, self-supervised, and parameter-efficient fine-tuning methods across datasets like ModelNet40, ShapeNet, and remote sensing benchmarks.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.