SpatialClaw: エージェントによる空間推論のための行動インターフェースの再考

要旨

空間推論、すなわち物体がどこにあり、どのように関連し、3次元空間内でどのように動くかを判断する能力は、視覚言語モデル（VLM）にとって依然として根本的な課題である。ツール拡張エージェントは、VLMに専門的な知覚モジュールを追加することでこの問題に取り組もうとするが、その効果は、それらのツールが呼び出されるアクションインターフェースによって制限される。本研究では、このインターフェースの設計が、エージェントのオープンエンドな空間推論能力をどのように形作るかを調査する。既存の空間エージェントは、中間結果を観察する前に完全な分析戦略を確定するシングルパスコード実行を用いるか、あるいは構造化されたツール呼び出しインターフェースに依存しており、操作を自由に組み合わせたり、タスクごとに分析を調整したりする柔軟性に欠けることが多い。どちらの設計も、オープンエンドで複雑な3D/4D空間推論には限定的な柔軟性しか提供しない。そこで我々は、アクションインターフェースとしてコードを採用する、学習不要の空間推論フレームワーク「SpatialClaw」を提案する。SpatialClawは、入力フレームと一連の知覚・幾何プリミティブを事前に読み込んだステートフルなPythonカーネルを維持し、VLMを基盤とするエージェントが、これまでのすべての出力に基づいて、ステップごとに1つの実行可能なセルを記述できるようにする。これにより、エージェントは知覚結果を柔軟に構成・操作し、中間のテキストや視覚観測、および各問題の要求に応じて分析を適応させることが可能になる。静的および動的な3D/4D空間推論タスクを幅広くカバーする20の空間推論ベンチマークで評価した結果、SpatialClawは平均精度59.9%を達成し、最近の空間エージェントを+11.2ポイント上回り、2つのモデルファミリーにわたる6つのVLMバックエンドで、ベンチマークやモデル固有の適応を一切行わずに一貫した改善を示した。

English

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.