ChatPaper.aiChatPaper

Skill-3D: 面向智能体三维空间推理的演化场景感知技能

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

June 5, 2026
作者: Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang
cs.AI

摘要

本文探讨了代理型3D空间理解,即多模态大语言模型(MLLM)代理通过工具使用执行3D推理。现有方法在3D场景下常误用工具,并表现出有偏好的工具倾向,导致代理范式相较于非代理策略仅有微弱的性能提升。我们揭示出3D空间推理任务在不同场景间具有异构性,而现有代理对所有场景采用统一的工具使用策略,而非根据具体场景和任务选择工具。为此,我们提出Skill-3D,一种学习自进化场景感知技能的框架。具体而言,Skill-3D识别任务场景,并将代理的工具使用轨迹记录到场景记忆中,其中来自相似场景的成功轨迹被聚合和蒸馏成可复用的场景感知技能,而失败轨迹则作为经验教训附加到该技能中。在训练过程中,一旦相似场景再次出现,相应技能被注入以指导代理,产生新的轨迹,其成败结果进一步优化该技能,形成记忆与技能库协同进化的循环。实验表明,Skill-3D显著提升了3D空间推理中的工具利用率(在VSI-Bench上从39%提升至78%),推动代理走向正确且充分的工具使用。例如,在MMSI-Bench上,它将Gemini-3-Flash的性能提升了67%。此外,我们在技能引导的轨迹上进行了代理后训练,使Qwen3-VL-8B在VSI-Bench上提升了43%。
English
This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.