STAR-R1: マルチモーダルLLMの強化による空間変換推論

要旨

マルチモーダル大規模言語モデル（MLLMs）は多様なタスクにおいて顕著な能力を発揮しているが、空間推論においては人間に大きく後れを取っている。本研究では、視点の変化に伴う画像間での物体変換を識別することを要求する、難易度の高いタスクであるTransformation-Driven Visual Reasoning（TVR）を通じて、このギャップを調査する。従来の教師ありファインチューニング（SFT）では、クロスビュー設定において一貫した推論パスを生成することができない一方、スパース報酬の強化学習（RL）は探索の非効率性と収束の遅さに悩まされている。これらの課題を解決するため、我々はTVRに特化した細粒度の報酬メカニズムを統合した単一段階のRLパラダイムであるSTAR-R1を提案する。具体的には、STAR-R1は部分的な正解を報酬とし、過剰な列挙と受動的な無行動をペナルティ化することで、効率的な探索と精密な推論を可能にする。包括的な評価により、STAR-R1は全ての11のメトリクスにおいて最先端の性能を達成し、クロスビューシナリオにおいてSFTを23%上回ることが示された。さらに、STAR-R1の人間らしい振る舞いを分析し、空間推論を改善するために全ての物体を比較する独自の能力を明らかにした。本研究は、MLLMsと推論モデルの研究を進める上で重要な知見を提供する。コード、モデル重み、データはhttps://github.com/zongzhao23/STAR-R1で公開予定である。

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.

STAR-R1: マルチモーダルLLMの強化による空間変換推論

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

要旨

Support