Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning
Summary
Skill-3D is a framework that enables AI agents to learn scene-aware skills through self-evolving memory and skill libraries, significantly improving tool utilization in 3D spatial reasoning tasks (e.g., from 39% to 78% on VSI-Bench).
View Cached Full Text
Cached at: 06/10/26, 12:13 AM
Paper page - Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning
Source: https://huggingface.co/papers/2606.07436
Abstract
Skill-3D framework enables agents to learn scene-aware skills through self-evolving memory and skill libraries, improving tool utilization in 3D spatial reasoning tasks.
This paper exploresagentic 3D spatial understanding, i.e.,MLLM agentsperforming 3D reasoning throughtool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that3D spatial reasoningtasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learnsself-evolvingscene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent’s tool-use trajectory into aScene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and theskill libraryco-evolve. Experiments show that Skill-3D substantially improves tool utilization in3D spatial reasoning(from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficienttool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conductagentic post-trainingover skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.
View arXiv pageView PDFProject pageGitHub6Add to collection
Get this paper in your agent:
hf papers read 2606\.07436
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07436 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.07436 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.07436 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
Introduces SkillDAG, a self-evolving typed directed graph for LLM skill selection at scale that models inter-skill relationships and allows agents to query and evolve the graph during execution, outperforming baselines on ALFWorld and SkillsBench.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
SkillGraph is a framework that represents reusable skills as nodes in a directed graph to enable large language model agents to handle compositional tasks more effectively through structured skill retrieval and continuous evolution.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 is a unified framework that trains a single policy to co-evolve skill selection, utilization, and distillation using a shared task-outcome objective. Experiments on ALFWorld and WebShop show it outperforms existing baselines in complex task environments.