Faithful GRPO: 제약 조건 기반 정책 최적화를 통한 멀티모달 언어 모델의 시공간 추론 능력 향상

초록

검증 가능한 보상 강화학습(RLVR)으로 훈련된 다중모달 추론 모델(MRM)은 시각적 추론 벤치마크에서 정확도 향상을 보입니다. 그러나 우리는 정확도 향상이 종종 추론 품질의 저하를 동반한다는 점을 관찰했습니다: 생성된 사고 연쇄(CoT) 추적이 최종 답변과 불일치하거나 시각적 증거에 제대로 기반을 두지 않는 경우가 빈번합니다. 우리는 7개의 도전적인 실제 공간 추론 벤치마크에서 이 현상을 체계적으로 연구했으며, 이 현상이 ViGoRL-Spatial, TreeVGR와 같은 현대 MRM뿐만 아니라 표준 그룹 상대 정책 최적화(GRPO)로 훈련된 우리 자신의 모델에도 영향을 미친다는 사실을 발견했습니다. 우리는 CoT 추론 품질을 두 가지 상호 보완적인 축으로 특징짓습니다: "논리적 일관성"(CoT가 최종 답변을 필연적으로 도출하는가?)과 "시각적 근거성"(각 추론 단계가 이미지 내 객체, 속성, 공간 관계를 정확하게 설명하는가?). 이를 해결하기 위해 우리는 Lagrangian dual ascent를 통해 일관성과 근거성을 제약 조건으로 강제하는 GRPO 변형인 Faithful GRPO(FGRPO)를 제안합니다. FGRPO는 그룹 내 이점 계산에 배치 수준의 일관성 및 근거성 제약 조건을 통합하고, 최적화 과정에서 제약 조건의 상대적 중요도를 적응적으로 조정합니다. 우리는 7개의 공간 데이터셋에 대해 Qwen2.5-VL-7B 및 3B 백본에서 FGRPO를 평가합니다. 우리의 결과는 FGRPO가 추론 품질을 크게 향상시켜 불일치율을 24.5%에서 1.7%로 줄이고 시각적 근거성 점수를 +13% 개선함을 보여줍니다. 또한 단순 GRPO 대비 최종 답변 정확도도 향상시켜, 신뢰할 수 있는 추론이 더 나은 답변을 가능하게 함을 입증합니다.

English

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

Faithful GRPO: 제약 조건 기반 정책 최적화를 통한 멀티모달 언어 모델의 시공간 추론 능력 향상

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

초록

Support