空間樹:多模態大語言模型中空間能力如何分支發展
SpatialTree: How Spatial Abilities Branch Out in MLLMs
December 23, 2025
作者: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang
cs.AI
摘要
認知科學研究表明,空間能力呈現從感知到推理再到互動的漸進發展規律。然而在多模態大語言模型(MLLMs)中,這種層級結構仍未被充分理解,現有研究多侷限於狹窄的任務集。我們提出受認知科學啟發的SpatialTree層級框架,將空間能力劃分為四個層級:低階感知(L1)、心理表徵(L2)、模擬推演(L3)和具身協同(L4)。基於此分類體系,我們構建了首個以能力為中心的層級化基準測試,對主流MLLMs在27項子能力上進行全面評估。評估結果揭示出清晰的結構特徵:L1技能基本相互獨立,而高階技能呈現強相關性,表明能力間依賴性隨層級提升而增強。透過定向監督微調實驗,我們發現了有趣的遷移動態——L1層級存在負遷移現象,但從低階到高階能力存在顯著的跨層級正向遷移與協同效應。最後我們探索了全面提升層級能力的方法:發現鼓勵過度「思考」的樸素強化學習(RL)策略不可靠,雖有助複雜推理卻會損害直覺感知。為此我們提出簡潔的自動思考調控機制,抑制不必要的思維鏈長度,使RL能夠穩定提升所有層級的性能。SpatialTree的建立為理解並系統化擴展MLLMs的空間能力提供了概念驗證框架。
English
Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.