Tag
Skill-3D is a framework that enables AI agents to learn scene-aware skills through self-evolving memory and skill libraries, significantly improving tool utilization in 3D spatial reasoning tasks (e.g., from 39% to 78% on VSI-Bench).
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.