S-Agent: 공간 도구 사용이 공간 지능을 위한 추론을 유도하다

초록

현실 세계의 공간 지능은 연속적이고 진화하는 3D 세계에 대한 추론을 필요로 하지만, 기존의 VLM 및 도구 보강 에이전트는 대체로 고립된 시각 관찰로부터의 정적이고 상태 비보존적 추론에 머물러 있습니다. 본 논문에서는 연속적인 다시점 이미지와 비디오를 이해하고 추론하기 위한 공간적 도구 사용 에이전트 패러다임인 \textsc{S-Agent}를 소개합니다. 공간 추론을 개별 프레임 수준 예측이 아닌 시공간적 증거 축적으로 정식화함으로써, S-Agent는 프레임 중심 인식을 넘어 장면 중심 이해로 공간 지각을 재구성합니다. 구체적으로, S-Agent는 VLM을 어떤 증거가 필요한지 결정하는 의미론적 계획자로 설정하는 동시에, 계층적 공간 도구와 전문가가 객체를 2D에 고정시키고 이를 3D 기하 증거로 승격시키며, 이 증거를 집계하여 개수, 측정, 방향, 상대적 위치와 같은 고수준 공간 지식으로 통합합니다. 또한, 진화하는 장면 상태를 유지하는 장면 메모리(Scene Memory)와 추론 맥락을 축적하는 에이전트 메모리(Agent Memory)를 포함한 시간적 메모리 메커니즘을 통해 프레임과 추론 단계를 넘나드는 증거 통합이 가능합니다. 다시점 및 비디오 공간 추론 벤치마크에 대한 포괄적인 실험 결과, S-Agent가 훈련 없이도 오픈소스 및 폐쇄형 VLM 모두를 일관되게 개선함을 보여줍니다. 추론 시점 증강을 넘어, S-Agent가 생성한 공간적 궤적 S-300K에 대한 지도 미세 조정(SFT)을 통해 S-Agent-8B라는 소형 공간 에이전트를 얻었으며, 이는 유사 규모의 기준 모델(예: Qwen3-VL-8B)을 크게 능가하고, 고급 폐쇄형 모델(예: GPT-5.4 및 Gemini 3)과 비슷한 성능을 나타냅니다.

English

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, S-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on S-Agent-generated spatial trajectories S-300K yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).