空间思考者:通过空间奖励强化多模态大语言模型的三维推理能力
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
November 10, 2025
作者: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
cs.AI
摘要
多模态大语言模型(MLLMs)在视觉语言任务中取得了显著进展,但在空间理解方面仍存在不足。现有的空间MLLMs往往依赖显式3D输入或特定架构修改,且受限于大规模数据集或稀疏监督。为突破这些局限,我们提出SpatialThinker——一种通过强化学习训练的3D感知MLLM,它将结构化空间定位与多步推理相结合。该模型通过构建任务相关对象和空间关系的场景图,并借助密集空间奖励进行推理,模拟类人空间感知能力。SpatialThinker包含两大核心贡献:(1)生成高质量空间视觉问答数据集STVQA-7K的数据合成流程;(2)采用多目标密集空间奖励的在线强化学习机制以强化空间定位。实验表明,7B参数的SpatialThinker在空间理解和真实场景VQA基准上均优于监督微调与稀疏强化学习基线,其性能增益较稀疏强化学习接近翻倍,并超越GPT-4o。这些结果验证了将空间监督与奖励对齐推理相结合的有效性,能够在有限数据下实现稳健的3D空间理解,推动MLLMs向人类水平的视觉推理迈进。
English
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.