ChatPaper.aiChatPaper

RewardMap:通过多阶段强化学习应对细粒度视觉推理中的稀疏奖励问题

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

October 2, 2025
作者: Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, Huan Wang
cs.AI

摘要

细粒度视觉推理仍然是多模态大语言模型(MLLMs)面临的核心挑战。近期提出的ReasonMap凸显了这一差距,表明即使是先进的MLLMs在如交通地图等结构化且信息丰富的环境中进行空间推理时也显得力不从心,而这一任务具有明确的实践与科学意义。然而,针对此类任务的标准强化学习(RL)因奖励稀疏和优化不稳定而受阻。为解决这一问题,我们首先构建了ReasonMap-Plus,这是一个通过视觉问答(VQA)任务引入密集奖励信号的扩展数据集,从而有效启动细粒度视觉理解技能的冷启动训练。接着,我们提出了RewardMap,一个旨在提升MLLMs视觉理解与推理能力的多阶段RL框架。RewardMap包含两项关键设计:其一,我们引入了难度感知的奖励设计,融入细节奖励,直接应对奖励稀疏问题,同时提供更丰富的监督信息;其二,我们提出了一种多阶段RL方案,从简单感知任务逐步引导至复杂推理任务,相比传统的监督微调(SFT)提供了更有效的冷启动策略。在ReasonMap和ReasonMap-Plus上的实验表明,RewardMap的每个组件均能带来持续的性能提升,而它们的组合则实现了最佳效果。此外,采用RewardMap训练的模型在涵盖空间推理、细粒度视觉推理及超越交通地图的通用任务等6个基准测试中平均提升了3.47%,显著增强了视觉理解与推理能力。
English
Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.
PDF172October 3, 2025