Odyseus - Spatial VLM : Projecting 2D reasoning into 3D outputs (open source repo)
Summary
Odyseus is an open-source Spatial VLM tool that combines Qwen with Depth Anything to project 2D visual reasoning into actionable 3D coordinates for robotics and physical AI applications.
Similar Articles
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.