AlloSpatial：基盤モデルにおける空間推論のためのエージェンティック・ハーネスフレームワーク

要旨

多模态基础模型（MFMs）已取得显著进展，但在物理世界的空间推理方面仍显脆弱。其关键瓶颈在于无法将局部的自我中心观测转化为全局的他人中心空间表征。为解决这一问题，我们提出AlloSpatial——一个用于基础模型他人中心空间认知的智能体框架。AlloSpatial引入了World2Mind，这是一个即插即用的认知映射沙盒，可将自我中心观测转化为结构化的他人中心先验知识，包括支持对象拓扑、几何关系、可通行性及轨迹查询的他人中心空间树（Allocentric-Spatial Trees）与路径地图。为在噪声重建与模糊视觉证据下可靠地利用这些先验信息，AlloSpatial设计了一个空间推理套件（Spatial Reasoning Harness），用于工具使用判断、模态解耦线索收集以及几何-语义仲裁。我们还通过冷启动强化学习，结合套件门控的轨迹级奖励，将该过程内化至Qwen3-VL模型。在VSI-Bench与MindCube上的实验表明，AlloSpatial在无需训练的情况下，使专有模型性能提升5%-18%；即便去除视觉输入，ASTs本身也能支撑强大的空间推理。经过训练的AlloSpatial智能体进一步超越了更大规模的通用模型及具有竞争力的空间基准模型，这表明结构化的他人中心表征、主动工具使用以及可验证的推理，为构建具备空间能力的基础模型提供了一条有前景的路径。

English

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.