STAR-R1：通過強化多模態大語言模型實現空間轉換推理

摘要

多模態大型語言模型（MLLMs）在多樣化任務中展現了卓越的能力，但在空間推理方面仍顯著落後於人類。我們通過變換驅動的視覺推理（TVR）這一挑戰性任務來探討這一差距，該任務要求在不同視角下識別圖像中物體的變換。雖然傳統的監督微調（SFT）在跨視角設置中無法生成連貫的推理路徑，而稀疏獎勵的強化學習（RL）則面臨探索效率低下和收斂緩慢的問題。為解決這些限制，我們提出了STAR-R1，這是一個新穎的框架，它將單階段RL範式與專為TVR設計的細粒度獎勵機制相結合。具體而言，STAR-R1獎勵部分正確性，同時懲罰過度枚舉和被動無作為，從而實現高效探索和精確推理。全面評估表明，STAR-R1在所有11項指標上均達到了最先進的性能，在跨視角場景中比SFT高出23%。進一步分析揭示了STAR-R1的擬人化行為，並強調了其比較所有物體以提升空間推理能力的獨特優勢。我們的工作為推進MLLMs和推理模型的研究提供了關鍵見解。代碼、模型權重和數據將在https://github.com/zongzhao23/STAR-R1 公開提供。

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.