忠实GRPO：通过约束策略优化提升多模态语言模型的视觉空间推理能力

摘要

采用可验证奖励强化学习（RLVR）训练的多模态推理模型（MRM）在视觉推理基准测试中展现出更高的准确率。然而我们发现，准确率的提升往往以牺牲推理质量为代价：生成的思维链（CoT）轨迹常与最终答案不一致，且缺乏对视觉证据的充分依据。我们系统性地研究了七个高难度真实世界空间推理基准中的这一现象，发现该问题影响着ViGoRL-Spatial、TreeVGR等当代MRM模型，以及我们使用标准群组相对策略优化（GRPO）训练的模型。我们从两个互补维度刻画CoT推理质量："逻辑一致性"（CoT是否必然推导出最终答案？）和"视觉依据性"（每个推理步骤是否准确描述图像中的物体、属性及空间关系？）。为此，我们提出忠实GRPO（FGRPO），一种通过拉格朗日对偶上升法强制实现一致性与依据性约束的GRPO变体。FGRPO将批次级一致性与依据性约束纳入群组内的优势度计算，并在优化过程中自适应调整约束的相对重要性。基于Qwen2.5-VL-7B和3B主干网络在七个空间数据集上的评估表明：FGRPO显著提升推理质量，将不一致率从24.5%降至1.7%，视觉依据性得分提升13%。该方法同时超越了基础GRPO的最终答案准确率，印证了忠实推理可催生更优答案的结论。

English

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

忠实GRPO：通过约束策略优化提升多模态语言模型的视觉空间推理能力

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

摘要

Support