기하학적 제약을 가진 공간 추론 에이전트

초록

비전 언어 모델(VLM)은 공간 추론에서 근본적인 의미-기하학적 간극을 보입니다: 이들은 정성적 의미 추론에서는 뛰어나나, 추론 과정이 손실이 있는 의미 공간 내에서 이루어져 고정밀 기하학 정보와 정렬되지 않습니다. 현재의 패러다임은 이 간극을 해소하지 못하고 있습니다. 학습 기반 방법은 불완전한 오라클로부터 결함 있는 공간 논리를 학습하는 "오라클 패러독스"에 시달립니다. 도구 통합 방법은 최종 계산은 제약하지만, VLM의 계획 과정을 결정적으로 제약하지 않아 기하학적으로 결함 있는 계획을 초래합니다. 본 연구에서는 형식적 작업 제약을 도입하여 이 간극을 해결하는 학습 불필요 에이전트 패러다임인 기하학적 제약 에이전트(GCA)를 제안합니다. 구체적으로, 우리는 VLM의 역할을 두 단계로 전략적으로 분리합니다. 첫째, 의미 분석가로서 VLM은 사용자의 모호한 질의를 기준 좌표계와 목적을 정의하는 형식적이고 검증 가능한 작업 제약으로 변환합니다. 둘째, 작업 해결사로서 VLM은 해당 제약에 의해 정의된 결정론적 범위 내에서 엄격하게 도구 호출을 생성하고 실행합니다. 이 기하학적 제약 추론 전략은 의미-기하학적 간극을 성공적으로 해소하여 공간 추론을 위한 강력하고 검증 가능한 추론 경로를 제공합니다. 포괄적인 실험을 통해 GCA가 여러 공간 추론 벤치마크에서 SOTA 성능을 달성하며, 기존 학습 기반 및 도구 통합 방법을 약 27% 앞지르는 것으로 나타났습니다. 자세한 내용은 홈페이지(https://gca-spatial-reasoning.github.io)를 참조하시기 바랍니다.

English

Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.