RewardMap:通過多階段強化學習應對細粒度視覺推理中的稀疏獎勵問題
RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
October 2, 2025
作者: Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, Huan Wang
cs.AI
摘要
細粒度視覺推理仍然是多模態大型語言模型(MLLMs)面臨的核心挑戰。近期引入的ReasonMap凸顯了這一差距,表明即使是先進的MLLMs在結構化且信息豐富的環境(如交通地圖)中的空間推理任務上也表現不佳,而這類任務具有明確的實踐與科學重要性。然而,標準的強化學習(RL)在此類任務上因獎勵稀疏和優化不穩定而受阻。為解決這一問題,我們首先構建了ReasonMap-Plus,這是一個通過視覺問答(VQA)任務引入密集獎勵信號的擴展數據集,從而實現細粒度視覺理解技能的有效冷啟動訓練。接著,我們提出了RewardMap,這是一個旨在提升MLLMs視覺理解與推理能力的多階段RL框架。RewardMap包含兩項關鍵設計:首先,我們引入了一種難度感知的獎勵設計,結合細節獎勵,直接應對獎勵稀疏問題,同時提供更豐富的監督;其次,我們提出了一種多階段RL方案,從簡單的感知任務逐步引導至複雜的推理任務,相比傳統的監督微調(SFT)提供了更有效的冷啟動策略。在ReasonMap和ReasonMap-Plus上的實驗表明,RewardMap的每個組件均能帶來一致的性能提升,而它們的組合則能取得最佳效果。此外,使用RewardMap訓練的模型在涵蓋空間推理、細粒度視覺推理及超越交通地圖的通用任務的6個基準測試中,平均提升了3.47%,進一步證明了其視覺理解與推理能力的增強。
English
Fine-grained visual reasoning remains a core challenge for multimodal large
language models (MLLMs). The recently introduced ReasonMap highlights this gap
by showing that even advanced MLLMs struggle with spatial reasoning in
structured and information-rich settings such as transit maps, a task of clear
practical and scientific importance. However, standard reinforcement learning
(RL) on such tasks is impeded by sparse rewards and unstable optimization. To
address this, we first construct ReasonMap-Plus, an extended dataset that
introduces dense reward signals through Visual Question Answering (VQA) tasks,
enabling effective cold-start training of fine-grained visual understanding
skills. Next, we propose RewardMap, a multi-stage RL framework designed to
improve both visual understanding and reasoning capabilities of MLLMs.
RewardMap incorporates two key designs. First, we introduce a difficulty-aware
reward design that incorporates detail rewards, directly tackling the sparse
rewards while providing richer supervision. Second, we propose a multi-stage RL
scheme that bootstraps training from simple perception to complex reasoning
tasks, offering a more effective cold-start strategy than conventional
Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus
demonstrate that each component of RewardMap contributes to consistent
performance gains, while their combination yields the best results. Moreover,
models trained with RewardMap achieve an average improvement of 3.47% across 6
benchmarks spanning spatial reasoning, fine-grained visual reasoning, and
general tasks beyond transit maps, underscoring enhanced visual understanding
and reasoning capabilities.