SpatialClaw: 에이전트적 공간 추론을 위한 행동 인터페이스 재고

초록

공간 추론, 즉 객체의 위치, 관계, 3D 공간에서의 움직임을 파악하는 능력은 시각-언어 모델(VLM)에게 여전히 근본적인 도전 과제로 남아 있다. 도구 확장 에이전트는 VLM에 전문 지각 모듈을 추가하여 이 문제를 해결하려 하지만, 그 효과는 해당 도구가 호출되는 행동 인터페이스에 의해 제한된다. 본 연구에서는 이 인터페이스의 설계가 에이전트의 개방형 공간 추론 능력을 어떻게 형성하는지 분석한다. 기존의 공간 에이전트는 중간 결과를 관찰하기 전에 완전한 분석 전략을 확정하는 단일 패스 코드 실행을 사용하거나, 작업별로 분석을 자유롭게 구성하거나 맞춤화하는 데 유연성이 떨어지는 구조화된 도구 호출 인터페이스에 의존한다. 두 설계 모두 개방형의 복잡한 3D/4D 공간 추론에 제한된 유연성만을 제공한다. 이에 본 논문에서는 코드를 행동 인터페이스로 채택하는 학습 없는 공간 추론 프레임워크인 SpatialClaw를 제안한다. SpatialClaw는 입력 프레임과 일련의 인식 및 기하학 기본 요소가 사전 로드된 상태 유지 Python 커널을 유지하여, VLM 기반 에이전트가 모든 이전 출력을 조건으로 단계별로 하나의 실행 가능한 셀을 작성할 수 있게 한다. 이를 통해 에이전트는 인식 결과를 유연하게 구성하고 조작할 수 있으며, 중간 텍스트 및 시각적 관찰과 각 문제의 요구사항에 맞춰 분석을 적응시킬 수 있다. 다양한 정적 및 동적 3D/4D 공간 추론 작업을 포괄하는 20개의 공간 추론 벤치마크에 걸쳐 평가한 결과, SpatialClaw는 평균 정확도 59.9%를 달성하여 최근 공간 에이전트보다 +11.2포인트 높은 성능을 보였으며, 두 모델 패밀리의 6개 VLM 백본에서 벤치마크나 모델별 적응 없이 일관된 성능 향상을 나타냈다.

English

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.