S-Agent: 空間ツール使用が空間知能の推論を誘発する

要旨

実世界の空間知能には、連続的かつ動的に変化する3D世界に対する推論が必要である。しかし、既存のVLMやツール拡張エージェントの大半は、孤立した視覚観察からの静的かつステートレスな推論に留まっている。本稿では、連続的なマルチビュー画像や動画の理解と推論のための空間ツール利用エージェントパラダイムである \textsc{S-Agent} を導入する。空間推論を孤立したフレームレベルの予測ではなく時空間的な証拠の蓄積として定式化することで、S-Agentは空間認識をフレーム中心の認識を超えたシーン中心の理解へと再構成する。具体的には、S-AgentはVLMを、どの証拠が必要かを決定する意味プランナーとして位置づける一方、空間ツールとエキスパートの階層が物体を2Dでグラウンディングし、それらを3D幾何学的証拠へと持ち上げ、この証拠を高レベルの空間知識（例：計数、計測、方位、相対位置）に集約する。さらに、時間的記憶機構として、進化するシーン状態を維持するScene Memoryと推論コンテキストを蓄積するAgent Memoryを含むことで、フレーム間および推論ステップ間での証拠統合を可能にする。マルチビューおよび動画の空間推論ベンチマークに関する包括的な実験により、S-Agentがオープンソースとクローズドソースの両方のVLMを訓練不要の形で一貫して改善することが示された。推論時の拡張を超えて、S-Agentが生成した空間軌跡S-300Kを用いた教師あり微調整（SFT）により、S-Agent-8Bが得られた。これはコンパクトな空間エージェントであり、同規模のベースライン（例：Qwen3-VL-8B）を大幅に上回り、高度なクローズドソースモデル（例：GPT-5.4やGemini 3）と同等の性能を発揮する。

English

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, S-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on S-Agent-generated spatial trajectories S-300K yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).