ChatPaper.aiChatPaper

实际:激活多模态大语言模型的空间推理能力

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

November 3, 2025
作者: Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo
cs.AI

摘要

多模态大语言模型(MLLMs)的最新进展显著提升了二维视觉理解能力,这促使研究者开始探索其在复杂三维推理任务中的应用。然而,这些模型是否能有效捕捉现实场景中稳健性能所需的精细空间信息(尤其是跨视角一致性这一三维推理的关键要素)仍不明确。针对该问题,我们提出视角学习任务,旨在评估并增强MLLMs的空间推理能力。我们构建了包含10万个以物体为中心的多视角图像对及对应问答对的Viewpoint-100K数据集,并采用两阶段微调策略:首先通过监督微调向基线MLLM注入基础空间知识,使其在多项任务中取得显著提升;随后基于群体相对策略优化算法对更广泛问题进行强化学习以增强泛化能力。此外,我们提出混合冷启动初始化方法,可同步学习视角表征并保持连贯推理思维。实验结果表明,该方法显著激活了MLLMs的空间推理能力,在领域内和领域外推理任务中均表现出性能提升。我们的研究揭示了培养MLLMs基础空间技能的价值,将为机器人技术、自主系统及三维场景理解领域的未来发展提供支撑。
English
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.
PDF101January 19, 2026