SATORI-R1:通过空间定位与可验证奖励机制激励多模态推理
SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards
May 25, 2025
作者: Chuming Shen, Wei Wei, Xiaoye Qu, Yu Cheng
cs.AI
摘要
DeepSeek-R1在文本领域通过稳定的强化学习(RL)展现了强大的推理能力。近期,在多模态领域,研究开始直接应用RL来生成类似R1的自由形式推理,以应对视觉问答(VQA)任务。然而,多模态任务与文本任务在本质上存在显著差异,前者高度依赖于对输入图像的理解来解决问题。因此,在VQA任务中,此类自由形式推理面临两个关键限制:(1)延长的推理链会分散对任务关键区域的视觉关注,降低答案准确性。(2)不可验证的中间步骤加剧了策略梯度的方差和计算成本的开销。为解决这些问题,本文提出了SATORI(基于空间锚定的任务优化与强化学习),它将VQA分解为三个可验证的阶段,包括全局图像描述、区域定位和答案预测,每个阶段均提供明确的奖励信号。此外,我们还引入了VQA-Verify,一个包含12k条标注有答案对应描述和边界框的数据集,以促进训练。实验表明,在七个VQA基准测试中均实现了性能的持续提升,与类似R1的基线相比,准确率最高提升了15.7%。我们对注意力图的分析证实了对关键区域关注度的增强,从而带来了准确率的提升。我们的代码可在https://github.com/justairr/SATORI-R1获取。
English
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text
domain through stable reinforcement learning (RL). Recently, in the multimodal
domain, works have begun to directly apply RL to generate R1-like free-form
reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks
share an intrinsically different nature from textual tasks, which heavily rely
on the understanding of the input image to solve the problem. Therefore, such
free-form reasoning faces two critical limitations in the VQA task: (1)
Extended reasoning chains diffuse visual focus away from task-critical regions,
degrading answer accuracy. (2) Unverifiable intermediate steps amplify
policy-gradient variance and computational costs overhead. To address these
issues, in this paper, we introduce SATORI (Spatially
Anchored Task Optimization with
ReInforcement Learning), which decomposes VQA into three
verifiable stages, including global image captioning, region localization, and
answer prediction, each supplying explicit reward signals. Furthermore, we also
introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and
bounding-boxes to facilitate training. Experiments demonstrate consistent
performance improvements across seven VQA benchmarks, achieving up to 15.7%
improvement in accuracy in accuracy compared to the R1-like baseline. Our
analysis of the attention map confirms enhanced focus on critical regions,
which brings improvements in accuracy. Our code is available at
https://github.com/justairr/SATORI-R1.Summary
AI-Generated Summary