S-Agent: 空间工具使用激发推理以提升空间智能

摘要

现实世界中的空间智能需要对连续且不断演化的3D世界进行推理，但现有的视觉语言模型（VLM）和工具增强代理大多仍局限于从孤立视觉观察中进行静态、无状态的推断。我们提出\textsc{S-Agent}——一种空间工具使用代理范式，用于理解和推理连续的多视角图像与视频。通过将空间推理重新定义为时空证据累积而非孤立的帧级预测，S-Agent将空间感知从以帧为中心的识别重塑为以场景为中心的理解。具体而言，S-Agent将VLM视为语义规划器，决定需要何种证据；同时，层次化的空间工具与专家将物体在2D中定位、提升至3D几何证据，并聚合成高层空间知识（如计数、测量、朝向和相对位置）。此外，时间记忆机制（包括用于维护场景演化状态的场景记忆和用于累积推理上下文的代理记忆）实现了跨帧和跨推理步骤的证据整合。在多视角与视频空间推理基准上的全面实验表明，S-Agent能以无需训练的方式持续提升开源与闭源VLM的性能。除了推理时增强外，在S-Agent生成的空间轨迹数据集S-300K上进行监督微调（SFT）所得的紧凑型空间代理S-Agent-8B，显著超越同规模基线（如Qwen3-VL-8B），性能与先进闭源模型（如GPT-5.4和Gemini 3）相当。

English

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, S-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on S-Agent-generated spatial trajectories S-300K yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).