SATORI-R1：通过空间定位与可验证奖励机制激励多模态推理

摘要

DeepSeek-R1在文本领域通过稳定的强化学习（RL）展现了强大的推理能力。近期，在多模态领域，研究开始直接应用RL来生成类似R1的自由形式推理，以应对视觉问答（VQA）任务。然而，多模态任务与文本任务在本质上存在显著差异，前者高度依赖于对输入图像的理解来解决问题。因此，在VQA任务中，此类自由形式推理面临两个关键限制：（1）延长的推理链会分散对任务关键区域的视觉关注，降低答案准确性。（2）不可验证的中间步骤加剧了策略梯度的方差和计算成本的开销。为解决这些问题，本文提出了SATORI（基于空间锚定的任务优化与强化学习），它将VQA分解为三个可验证的阶段，包括全局图像描述、区域定位和答案预测，每个阶段均提供明确的奖励信号。此外，我们还引入了VQA-Verify，一个包含12k条标注有答案对应描述和边界框的数据集，以促进训练。实验表明，在七个VQA基准测试中均实现了性能的持续提升，与类似R1的基线相比，准确率最高提升了15.7%。我们对注意力图的分析证实了对关键区域关注度的增强，从而带来了准确率的提升。我们的代码可在https://github.com/justairr/SATORI-R1获取。

English

DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI (Spatially Anchored Task Optimization with ReInforcement Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to 15.7% improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at https://github.com/justairr/SATORI-R1.

SATORI-R1：通过空间定位与可验证奖励机制激励多模态推理

SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards

摘要

Support