RewardMap: 다단계 강화 학습을 통한 세밀한 시각적 추론의 희소 보상 문제 해결

초록

세밀한 시각적 추론은 다중모달 대형 언어 모델(MLLMs)의 핵심 과제로 남아 있습니다. 최근 소개된 ReasonMap는 고급 MLLMs조차도 교통 지도와 같은 구조화되고 정보가 풍부한 환경에서 공간 추론에 어려움을 겪는다는 점을 보여줌으로써 이러한 격차를 강조했습니다. 이는 실용적 및 과학적으로 중요한 과제임에도 불구하고, 이러한 작업에 대한 표준 강화 학습(RL)은 희소한 보상과 불안정한 최적화로 인해 방해를 받습니다. 이를 해결하기 위해, 우리는 먼저 시각 질의 응답(VQA) 작업을 통해 밀집된 보상 신호를 도입한 확장 데이터셋인 ReasonMap-Plus를 구축하여 세밀한 시각적 이해 능력의 효과적인 콜드 스타트 훈련을 가능하게 합니다. 다음으로, 우리는 MLLMs의 시각적 이해 및 추론 능력을 모두 향상시키기 위해 설계된 다단계 RL 프레임워크인 RewardMap를 제안합니다. RewardMap는 두 가지 주요 설계를 포함합니다. 첫째, 세부 보상을 통합한 난이도 인식 보상 설계를 도입하여 희소한 보상 문제를 직접 해결하면서 더 풍부한 지도를 제공합니다. 둘째, 단순한 인식에서 복잡한 추론 작업으로 훈련을 부트스트랩하는 다단계 RL 방식을 제안하여 기존의 지도 미세 조정(SFT)보다 더 효과적인 콜드 스타트 전략을 제공합니다. ReasonMap와 ReasonMap-Plus에 대한 실험은 RewardMap의 각 구성 요소가 일관된 성능 향상에 기여하며, 이들의 조합이 최상의 결과를 가져온다는 것을 보여줍니다. 또한, RewardMap로 훈련된 모델은 교통 지도를 넘어 공간 추론, 세밀한 시각적 추론 및 일반 작업을 아우르는 6개 벤치마크에서 평균 3.47%의 개선을 달성하여 향상된 시각적 이해 및 추론 능력을 입증했습니다.

English

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

RewardMap: 다단계 강화 학습을 통한 세밀한 시각적 추론의 희소 보상 문제 해결

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

초록

Support