RewardMap：多段階強化学習による細粒度視覚推論におけるスパース報酬問題への取り組み

要旨

細粒度の視覚的推論は、マルチモーダル大規模言語モデル（MLLM）にとって依然として中核的な課題である。最近導入されたReasonMapは、高度なMLLMでさえ、交通路線図のような構造化され情報豊富な設定における空間推論に苦戦することを示し、このギャップを浮き彫りにしている。これは、実用的かつ科学的に重要な課題である。しかし、このようなタスクにおける標準的な強化学習（RL）は、スパースな報酬と不安定な最適化によって妨げられている。これを解決するため、まずReasonMap-Plusを構築した。これは、視覚的質問応答（VQA）タスクを通じて密な報酬信号を導入し、細粒度の視覚的理解スキルの効果的なコールドスタートトレーニングを可能にする拡張データセットである。次に、RewardMapを提案する。これは、MLLMの視覚的理解と推論能力の両方を向上させるために設計された多段階RLフレームワークである。RewardMapは、2つの主要な設計を組み込んでいる。第一に、詳細報酬を取り入れた難易度対応報酬設計を導入し、スパースな報酬に直接取り組みながら、より豊富な監督を提供する。第二に、単純な知覚から複雑な推論タスクへとトレーニングをブートストラップする多段階RLスキームを提案し、従来の教師あり微調整（SFT）よりも効果的なコールドスタート戦略を提供する。ReasonMapとReasonMap-Plusでの実験により、RewardMapの各コンポーネントが一貫した性能向上に寄与し、それらの組み合わせが最良の結果をもたらすことが示された。さらに、RewardMapでトレーニングされたモデルは、交通路線図を超えた空間推論、細粒度の視覚的推論、および一般的なタスクにわたる6つのベンチマークで平均3.47%の改善を達成し、視覚的理解と推論能力の向上を裏付けている。

English

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

RewardMap：多段階強化学習による細粒度視覚推論におけるスパース報酬問題への取り組み

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

要旨

Support