ChatPaper.aiChatPaper

SpatialThinker:透過空間獎勵機制強化多模態大型語言模型的三維推理能力

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

November 10, 2025
作者: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
cs.AI

摘要

多模態大型語言模型(MLLMs)在視覺語言任務中取得了顯著進展,但在空間理解方面仍存在不足。現有的空間MLLMs通常依賴顯式3D輸入或特定架構修改,且受制於大規模數據集或稀疏監督。為解決這些局限性,我們提出SpatialThinker——一種通過強化學習訓練的3D感知MLLM,能將結構化空間定位與多步推理相結合。該模型通過構建任務相關物體與空間關係的場景圖譜,並藉助密集空間獎勵進行推理,模擬類人的空間感知能力。SpatialThinker包含兩大核心貢獻:(1)生成高質量空間視覺問答數據集STVQA-7K的數據合成流程;(2)採用多目標密集空間獎勵的線上強化學習機制以強化空間定位。SpatialThinker-7B在空間理解與真實世界VQA基準測試中表現優於監督微調和稀疏強化學習基線,其基礎模型增益較稀疏強化學習提升近一倍,並超越GPT-4o。這些成果證明了將空間監督與獎勵校準推理相結合的有效性,能在有限數據下實現魯棒的3D空間理解,推動MLLMs向人類級別的視覺推理邁進。
English
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.
PDF132December 1, 2025