忠实GRPO：通过约束策略优化提升多模态语言模型的视觉空间推理能力

摘要

采用可验证奖励强化学习（RLVR）训练的多模态推理模型（MRM）在视觉推理基准测试中展现出更高的准确率。然而我们发现，准确率的提升往往以牺牲推理质量为代价：生成的思维链（CoT）轨迹常与最终答案不一致，且缺乏对视觉证据的充分依据。我们系统性地研究了七个高难度真实世界空间推理基准中的这一现象，发现该问题普遍存在于ViGoRL-Spatial、TreeVGR等当代MRM模型，以及我们使用标准组相对策略优化（GRPO）训练的模型中。我们从两个互补维度界定CoT推理质量："逻辑一致性"（CoT是否必然推导出最终答案？）和"视觉依据性"（每个推理步骤是否准确描述图像中的物体、属性及空间关系？）。为此，我们提出可信GRPO（FGRPO）——通过拉格朗日对偶上升法将一致性与依据性作为约束条件的GRPO变体。FGRPO将批次级一致性与依据性约束融入组内优势度计算，在优化过程中自适应调整约束条件的相对重要性。基于Qwen2.5-VL-7B和3B骨干网络在七个空间数据集上的实验表明：FGRPO显著提升推理质量，将不一致率从24.5%降至1.7%，视觉依据性得分提升13%。该方法还较基础GRPO提升了最终答案准确率，印证了可信推理可催生更优答案。

English

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.