ChatPaper.aiChatPaper

STAR-R1:通过强化多模态大语言模型实现空间变换推理

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

May 21, 2025
作者: Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, Wenbing Huang
cs.AI

摘要

多模态大语言模型(MLLMs)在多种任务中展现了卓越的能力,但在空间推理方面仍显著落后于人类。我们通过变换驱动的视觉推理(TVR)这一挑战性任务来探究这一差距,该任务要求在不同视角下识别图像中物体的变换。尽管传统的监督微调(SFT)在跨视图场景中无法生成连贯的推理路径,而稀疏奖励的强化学习(RL)则面临探索效率低下和收敛缓慢的问题。针对这些局限,我们提出了STAR-R1,一个新颖的框架,它将单阶段RL范式与专为TVR设计的细粒度奖励机制相结合。具体而言,STAR-R1奖励部分正确性,同时惩罚过度枚举和消极不作为,从而实现高效探索和精确推理。全面评估表明,STAR-R1在所有11项指标上均达到了最先进的性能,在跨视图场景中比SFT高出23%。进一步分析揭示了STAR-R1的拟人行为,并突显了其通过比较所有对象来提升空间推理的独特能力。我们的工作为推进MLLMs和推理模型的研究提供了关键见解。代码、模型权重及数据将公开于https://github.com/zongzhao23/STAR-R1。
English
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.

Summary

AI-Generated Summary

PDF82May 27, 2025