Agent à Contraintes Géométriques pour le Raisonnement Spatial

Résumé

Les modèles de vision et langage (VLM) présentent un écart fondamental sémantique-géométrique dans le raisonnement spatial : ils excellent en inférence sémantique qualitative, mais leur raisonnement opère dans un espace sémantique à pertes, désaligné avec la géométrie haute fidélité. Les paradigmes actuels échouent à combler cet écart. Les méthodes par entraînement souffrent d'un « paradoxe de l'oracle », apprenant une logique spatiale erronée à partir d'oracles imparfaits. Les méthodes à outils intégrés contraignent le calcul final mais laissent de façon critique le processus de planification du VLM non contraint, produisant des plans géométriquement défectueux. Dans ce travail, nous proposons l'Agent à Contraintes Géométriques (GCA), un paradigme agentique sans entraînement qui résout cet écart via l'introduction d'une contrainte formelle de tâche. Spécifiquement, nous découplons stratégiquement le rôle du VLM en deux étapes. Premièrement, en tant qu'analyste sémantique, le VLM traduit la requête ambiguë de l'utilisateur en une contrainte de tâche formelle et vérifiable, qui définit le référentiel et l'objectif. Deuxièmement, en tant que solveur de tâche, le VLM génère et exécute des appels d'outils strictement dans les limites déterministes définies par la contrainte. Cette stratégie de raisonnement à contraintes géométriques résout avec succès l'écart sémantique-géométrique, offrant une voie de raisonnement robuste et vérifiable pour le raisonnement spatial. Des expériences exhaustives démontrent que GCA atteint des performances state-of-the-art sur plusieurs benchmarks de raisonnement spatial, surpassant les méthodes existantes par entraînement et à outils intégrés d'environ ~27%. Consultez notre page d'accueil à l'adresse https://gca-spatial-reasoning.github.io.

English

Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.

Agent à Contraintes Géométriques pour le Raisonnement Spatial

Geometrically-Constrained Agent for Spatial Reasoning

Résumé

Support