ChatPaper.aiChatPaper

Skill-3D:面向智能體三維空間推理的演進式場景感知技能

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

June 5, 2026
作者: Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang
cs.AI

摘要

本文探討了智能體3D空間理解,即多模態大語言模型(MLLM)智能體透過工具使用進行3D推理。現有方法常誤用工具,且在3D場景下展現出偏頗的工具偏好,導致智能體範式相較於非智能體策略僅有邊際效益提升。我們揭示出3D空間推理任務在不同場景中具有異質性,而這些智能體卻對所有場景採用統一的工具使用策略,而非根據具體場景與任務選擇工具。為解決此問題,我們提出Skill-3D框架,該框架能學習自我進化的場景感知技能。具體而言,Skill-3D識別任務場景,並將智能體的工具使用軌跡記錄於場景記憶(Scene Memory)中;來自相似場景的成功軌跡會被彙總並提煉為可重複使用的場景感知技能,而失敗軌跡則作為教訓附加於該技能。在訓練過程中,一旦類似場景再次出現,便注入對應技能以引導智能體,產生的新軌跡無論成功或失敗,都能進一步優化該技能,形成記憶與技能庫共同演化的閉環。實驗結果顯示,Skill-3D顯著提升了3D空間推理中的工具使用效率(在VSI-Bench上從39%提升至78%),促使智能體正確且充分地使用工具。例如,它在MMSI-Bench上將Gemini-3-Flash提升了67%。此外,我們對技能引導軌跡進行了智能體後訓練,使Qwen3-VL-8B在VSI-Bench上提升了43%。
English
This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.