AlloSpatial: 파운데이션 모델 내 공간 추론을 위한 에이전트 하네스 프레임워크

초록

다중 모달 기반 모델(Multimodal Foundation Models, MFMs)은 상당한 진전을 이루었지만, 물리적 세계에 대한 공간 추론에서는 여전히 취약함을 보인다. 핵심 병목은 자아중심적 관찰을 전역적 타자중심 공간 표현으로 변환하지 못하는 데 있다. 이를 해결하기 위해, 우리는 기반 모델의 타자중심 공간 인지를 위한 에이전트 기반 프레임워크인 AlloSpatial을 제안한다. AlloSpatial은 World2Mind를 도입하는데, 이는 플러그 앤 플레이 방식의 인지 매핑 샌드박스로서 자아중심적 관찰을 구조화된 타자중심 사전 정보로 변환한다. 여기에는 객체 위상, 기하학적 관계, 통과 가능성 및 궤적을 질의할 수 있는 타자중심 공간 트리(Allocentric-Spatial Trees)와 경로 맵이 포함된다. 잡음이 있는 재구성과 모호한 시각적 증거 하에서 이러한 사전 정보를 안정적으로 활용하기 위해, AlloSpatial은 공간 추론 하네스(Spatial Reasoning Harness)를 도입하여 도구 사용 판단, 모달리티 분리 단서 수집, 기하-의미 조정을 수행한다. 우리는 이 과정을 Qwen3-VL에 콜드 스타트 강화 학습과 하네스 게이트 궤적 수준 보상을 통해 내재화한다. VSI-Bench와 MindCube 실험 결과, AlloSpatial은 훈련 없이도 독점 모델의 성능을 5%~18% 향상시켰으며, AST 만으로도 시각 입력이 제거된 상태에서 강력한 공간 추론을 지원했다. 훈련된 AlloSpatial 에이전트는 더 큰 범용 모델과 경쟁력 있는 공간 기준 모델을 능가하여, 구조화된 타자중심 표현, 능동적 도구 사용, 검증 가능한 추론이 공간 능력을 갖춘 기반 모델로 가는 유망한 경로임을 시사한다.

English

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.