Odyseus - Spatial VLM : Projecting 2D reasoning into 3D outputs (open source repo)

Reddit r/ArtificialInteligence 05/11/26, 06:22 AM Tools

spatial-vlm physical-ai monocular-depth-estimation open-source robotics 3d-coordinates

Summary

Odyseus is an open-source Spatial VLM tool that combines Qwen with Depth Anything to project 2D visual reasoning into actionable 3D coordinates for robotics and physical AI applications.

So I've always argued that Physical AI for robotics need actionable outputs like 3D coordinates, not bullet points or nice paragraphs. So decided to experiment by combining a VLM with Monocular Depth Estimation, essentially projecting 2D reasoning into 3D, I called it Odyseus - Spatial VLM Tech Stack: \- VLM: Qwen 3.6 \- Depth Estimation: Depth Anything 3 - Metric Large Worked pretty well, figured to share, check repo: [https://github.com/MercuriusTech/Odyseus-Spatial-VLM](https://github.com/MercuriusTech/Odyseus-Spatial-VLM)

Original Article

Similar Articles

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Hugging Face Daily Papers

SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Hugging Face Daily Papers

Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Hugging Face Daily Papers

This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Hugging Face Daily Papers

Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Hugging Face Daily Papers

A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.

Similar Articles

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Submit Feedback