空间爪：重新思考智能体空间推理的动作接口

摘要

空间推理——即判断物体在三维空间中的位置、相互关系及运动方式的能力——仍然是视觉语言模型（VLM）面临的一项基本挑战。工具增强型智能体试图通过为VLM配备专门感知模块来解决这一问题，但其有效性受限于调用这些工具所依赖的动作接口。本研究探讨了该接口的设计如何影响智能体进行开放式空间推理的能力。现有空间智能体要么采用单次代码执行——在观察到任何中间结果之前就确定完整的分析策略，要么依赖结构化工具调用接口——这种接口在自由组合操作或针对具体任务定制分析方面通常灵活性不足。这两种设计都对开放式、复杂的3D/4D空间推理能力形成了制约。为此，我们提出SpatialClaw，一种无需训练的空间推理框架，采用代码作为动作接口。SpatialClaw维护一个带状态的Python内核，预加载输入帧及一套感知与几何基元，使基于VLM的智能体能够根据所有先前输出，每步编写一个可执行单元，从而灵活组合和操控感知结果，并根据中间文本与视觉观测以及每个问题的具体需求调整分析策略。在覆盖广泛静态与动态3D/4D空间推理任务的20个空间推理基准上评估，SpatialClaw平均准确率达59.9%，比近期空间智能体高出11.2个百分点，且在两个模型家族的六种VLM骨干上均取得一致提升，无需针对基准或模型进行任何特定适配。

English

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.