SpatialClaw : Repenser l'interface d'action pour le raisonnement spatial agentif

Résumé

Le raisonnement spatial, c’est-à-dire la capacité à déterminer où se trouvent les objets, comment ils interagissent et comment ils se déplacent en 3D, demeure un défi fondamental pour les modèles vision-langage (VLM). Les agents augmentés par des outils tentent d’y remédier en enrichissant les VLM de modules de perception spécialisés, mais leur efficacité est limitée par l’interface d’action à travers laquelle ces outils sont invoqués. Dans ce travail, nous étudions comment la conception de cette interface façonne la capacité de l’agent à effectuer un raisonnement spatial ouvert (open-ended). Les agents spatiaux actuels utilisent soit une exécution de code en un seul passage, qui s’engage dans une stratégie d’analyse complète avant d’observer un résultat intermédiaire, soit une interface structurée d’appels d’outils qui offre souvent moins de flexibilité pour composer librement des opérations ou adapter l’analyse à chaque tâche. Ces deux conceptions offrent une flexibilité limitée pour un raisonnement spatial 3D/4D ouvert et complexe. Nous proposons donc SpatialClaw, un cadre sans apprentissage (training-free) pour le raisonnement spatial qui adopte le code comme interface d’action. SpatialClaw maintient un noyau Python avec état, préchargé avec les images d’entrée et une suite de primitives de perception et de géométrie, permettant à un agent propulsé par un VLM d’écrire une cellule exécutable par étape, en fonction de toutes les sorties antérieures. Cela permet à l’agent de composer et de manipuler flexiblement les résultats de perception et d’adapter son analyse aux observations textuelles et visuelles intermédiaires ainsi qu’aux exigences de chaque problème. Évalué sur 20 benchmarks de raisonnement spatial couvrant un large éventail de tâches statiques et dynamiques de raisonnement spatial 3D/4D, SpatialClaw atteint une précision moyenne de 59,9 %, surpassant le récent agent spatial de +11,2 points, avec des gains constants sur six architectures VLM issues de deux familles de modèles, sans aucune adaptation spécifique au benchmark ou au modèle.

English

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.