Skill-3D: エージェント的3D空間推論のためのシーン認識スキルの進化

要旨

本論文では、エージェント型の3D空間理解、すなわちMLLMエージェントがツール使用を通じて3D推論を行う手法について探究する。既存手法はしばしばツールを誤用し、3Dシナリオにおいて偏ったツール選好を示すため、エージェント型パラダイムは非エージェント型戦略に対するわずかな利得しか得られていない。我々は、3D空間推論タスクがシーンごとに異質である一方、これらのエージェントは特定のシーンやタスクに応じてツールを選択するのではなく、すべてのシーンに一律のツール使用戦略を適用していることを明らかにする。この問題に対処するため、我々は自己進化的なシーン認識スキルを学習するフレームワークSkill-3Dを提案する。具体的には、Skill-3Dはタスクシーンを識別し、エージェントのツール使用軌跡をScene Memoryに記録する。Scene Memoryでは、類似シーンからの成功軌跡が集約され、再利用可能なシーン認識スキルに蒸留され、失敗軌跡は教訓としてスキルに付加される。訓練中、類似シーンが再び現れると、対応するスキルが注入されてエージェントを導き、新たな軌跡を生成する。その成功と失敗がさらにスキルを洗練させ、メモリとスキルライブラリが共進化するループを形成する。実験の結果、Skill-3Dは3D空間推論におけるツール利用を大幅に改善し（VSI-Benchで39%から78%へ）、エージェントを正しく十分なツール使用へと導くことが示された。例えば、MMSI-BenchではGemini-3-Flashを67%向上させる。さらに、スキル誘導軌跡に基づくエージェント型ポストトレーニングを実施し、VSI-BenchにおいてQwen3-VL-8Bを43%向上させる。

English

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.