STAR-R1: 다중모드 LLM 강화를 통한 공간 변환 추론

초록

멀티모달 대형 언어 모델(MLLMs)은 다양한 작업에서 뛰어난 능력을 보여주었지만, 공간 추론 능력에서는 인간에 비해 상당히 뒤처져 있습니다. 우리는 다양한 시점에서 이미지 간 객체 변환을 식별해야 하는 어려운 작업인 변환 기반 시각 추론(TVR)을 통해 이러한 격차를 조사했습니다. 전통적인 지도 미세 조정(SFT)은 교차 시점 설정에서 일관된 추론 경로를 생성하지 못하는 반면, 희소 보상 강화 학습(RL)은 비효율적인 탐색과 느린 수렴으로 어려움을 겪습니다. 이러한 한계를 해결하기 위해, 우리는 TVR에 맞춤화된 세밀한 보상 메커니즘과 단일 단계 RL 패러다임을 통합한 새로운 프레임워크인 STAR-R1을 제안합니다. 구체적으로, STAR-R1은 부분적인 정확성을 보상하면서 과도한 열거와 수동적 무행동을 처벌하여 효율적인 탐색과 정확한 추론을 가능하게 합니다. 포괄적인 평가 결과, STAR-R1은 모든 11개 메트릭에서 최첨단 성능을 달성하며, 교차 시점 시나리오에서 SFT를 23% 능가하는 것으로 나타났습니다. 추가 분석은 STAR-R1의 인간 유사 행동을 보여주고, 공간 추론을 개선하기 위해 모든 객체를 비교하는 독특한 능력을 강조합니다. 우리의 연구는 MLLMs 및 추론 모델 연구를 발전시키는 데 중요한 통찰력을 제공합니다. 코드, 모델 가중치 및 데이터는 https://github.com/zongzhao23/STAR-R1에서 공개될 예정입니다.

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.