RewardMap：通過多階段強化學習應對細粒度視覺推理中的稀疏獎勵問題

摘要

細粒度視覺推理仍然是多模態大型語言模型（MLLMs）面臨的核心挑戰。近期引入的ReasonMap凸顯了這一差距，表明即使是先進的MLLMs在結構化且信息豐富的環境（如交通地圖）中的空間推理任務上也表現不佳，而這類任務具有明確的實踐與科學重要性。然而，標準的強化學習（RL）在此類任務上因獎勵稀疏和優化不穩定而受阻。為解決這一問題，我們首先構建了ReasonMap-Plus，這是一個通過視覺問答（VQA）任務引入密集獎勵信號的擴展數據集，從而實現細粒度視覺理解技能的有效冷啟動訓練。接著，我們提出了RewardMap，這是一個旨在提升MLLMs視覺理解與推理能力的多階段RL框架。RewardMap包含兩項關鍵設計：首先，我們引入了一種難度感知的獎勵設計，結合細節獎勵，直接應對獎勵稀疏問題，同時提供更豐富的監督；其次，我們提出了一種多階段RL方案，從簡單的感知任務逐步引導至複雜的推理任務，相比傳統的監督微調（SFT）提供了更有效的冷啟動策略。在ReasonMap和ReasonMap-Plus上的實驗表明，RewardMap的每個組件均能帶來一致的性能提升，而它們的組合則能取得最佳效果。此外，使用RewardMap訓練的模型在涵蓋空間推理、細粒度視覺推理及超越交通地圖的通用任務的6個基準測試中，平均提升了3.47%，進一步證明了其視覺理解與推理能力的增強。

English

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

RewardMap：通過多階段強化學習應對細粒度視覺推理中的稀疏獎勵問題

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

摘要

Support