Faithful GRPO: 制約付きポリシー最適化によるマルチモーダル言語モデルの視覚的空間推論能力の改善

要旨

検証可能な報酬を用いた強化学習（RLVR）で学習されたマルチモーダル推論モデル（MRM）は、視覚的推論ベンチマークにおいて精度の向上を示す。しかし、精度の向上はしばしば推論の質を犠牲にして達成されていることが観察される。生成されるChain-of-Thought（CoT）トレースは、最終的な答えと矛盾していたり、視覚的証拠に十分に基づいていなかったりすることが頻繁にある。我々は、7つの困難な実世界の空間推論ベンチマークにおいてこの現象を体系的に調査し、ViGoRL-Spatial、TreeVGR、および標準的なGroup Relative Policy Optimization（GRPO）で学習された我々自身のモデルを含む、現代のMRMに影響を与えていることを明らかにした。我々はCoTの推論の質を、補完的な2つの軸、「論理的一貫性」（CoTは最終的な答えを必然的に導くか？）と「視覚的接地」（各推論ステップは画像内のオブジェクト、属性、空間関係を正確に記述しているか？）に沿って特徴付ける。この問題に対処するため、我々は一貫性と接地を制約としてラグランジュ双対上昇法により強制する、GRPOの変種であるFaithful GRPO（FGRPO）を提案する。FGRPOは、グループ内のアドバンテージ計算にバッチレベルの一貫性と接地の制約を組み込み、最適化中に制約の相対的重要度を適応的に調整する。我々は、7つの空間データセットにおいて、Qwen2.5-VL-7Bおよび3Bバックボーンに対してFGRPOを評価した。結果は、FGRPOが推論の質を大幅に改善し、不一致率を24.5%から1.7%に減少させ、視覚的接地スコアを+13%向上させることを示した。また、単純なGRPOと比較して最終的な答えの精度も向上し、忠実な推論がより良い答えを可能にすることを実証した。

English

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

Faithful GRPO: 制約付きポリシー最適化によるマルチモーダル言語モデルの視覚的空間推論能力の改善

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

要旨

Support