ChatPaper.aiChatPaper

AlloSpatial:基礎模型中空間推理的代理式框架

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

June 8, 2026
作者: Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei
cs.AI

摘要

多模態基礎模型(Multimodal Foundation Models, MFMs)已取得顯著進展,但在物理世界的空間推理方面仍顯脆弱。其關鍵瓶頸在於無法將局部自我中心(egocentric)觀察轉化為全局異中心(allocentric)空間表徵。為解決此問題,我們提出 AlloSpatial,一個用於基礎模型中異中心空間認知的智能體框架(agentic framework)。AlloSpatial 引入了 World2Mind,一個即插即用的認知映射沙盒(cognitive mapping sandbox),可將自我中心觀察轉換為結構化的異中心先驗資訊,包括異中心空間樹(Allocentric-Spatial Trees)與路線圖(route maps),支援查詢物體拓撲、幾何關係、可通行性及軌跡。為了在雜訊重建與模糊視覺證據下可靠地利用這些先驗資訊,AlloSpatial 引入了空間推理約束機制(Spatial Reasoning Harness),用於工具使用判斷、模態分離線索收集以及幾何-語義仲裁。我們進一步透過冷啟動強化學習(cold-start reinforcement learning),結合約束機制門控的軌跡層級獎勵,將此過程內化於 Qwen3-VL 中。在 VSI-Bench 與 MindCube 上的實驗顯示,AlloSpatial 在無需訓練的情境下,可使專有模型提升 5% 至 18%;而僅使用 ASTs 即使移除視覺輸入,仍能支援強大的空間推理。經過訓練的 AlloSpatial 智能體進一步超越了較大的通用模型與具競爭力的空間基線,表明結構化的異中心表徵、主動工具使用及可驗證推理,為具備空間能力的基础模型提供了一條有前景的路徑。
English
Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.