AlloSpatial：基礎模型中空間推理的代理式框架

摘要

多模態基礎模型（Multimodal Foundation Models, MFMs）已取得顯著進展，但在物理世界的空間推理方面仍顯脆弱。其關鍵瓶頸在於無法將局部自我中心（egocentric）觀察轉化為全局異中心（allocentric）空間表徵。為解決此問題，我們提出 AlloSpatial，一個用於基礎模型中異中心空間認知的智能體框架（agentic framework）。AlloSpatial 引入了 World2Mind，一個即插即用的認知映射沙盒（cognitive mapping sandbox），可將自我中心觀察轉換為結構化的異中心先驗資訊，包括異中心空間樹（Allocentric-Spatial Trees）與路線圖（route maps），支援查詢物體拓撲、幾何關係、可通行性及軌跡。為了在雜訊重建與模糊視覺證據下可靠地利用這些先驗資訊，AlloSpatial 引入了空間推理約束機制（Spatial Reasoning Harness），用於工具使用判斷、模態分離線索收集以及幾何-語義仲裁。我們進一步透過冷啟動強化學習（cold-start reinforcement learning），結合約束機制門控的軌跡層級獎勵，將此過程內化於 Qwen3-VL 中。在 VSI-Bench 與 MindCube 上的實驗顯示，AlloSpatial 在無需訓練的情境下，可使專有模型提升 5% 至 18%；而僅使用 ASTs 即使移除視覺輸入，仍能支援強大的空間推理。經過訓練的 AlloSpatial 智能體進一步超越了較大的通用模型與具競爭力的空間基線，表明結構化的異中心表徵、主動工具使用及可驗證推理，為具備空間能力的基础模型提供了一條有前景的路徑。

English

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.