GRPO-CARE：面向多模态推理的一致性感知强化学习

摘要

近期，如結果監督的GRPO等強化學習方法，已推動了大語言模型（LLMs）中的思維鏈推理進展，然而這些方法在多模態大語言模型（MLLMs）中的適應性尚未被探索。為解決MLLM後訓練方法缺乏嚴謹評估的問題，我們引入了SEED-Bench-R1，這是一個包含複雜現實世界視頻的基準測試，要求平衡的感知與推理能力。它提供了一個大型訓練集，並評估了在三個逐步升級的挑戰中的泛化能力：分佈內、跨環境及跨環境任務情境。利用SEED-Bench-R1，我們發現標準GRPO雖然提升了答案準確性，但往往降低了推理步驟與答案間的邏輯連貫性，一致性率僅為57.9%。這源於獎勵信號僅關注最終答案，鼓勵走捷徑，以及嚴格的KL懲罰限制了探索。為解決此問題，我們提出了GRPO-CARE，這是一個一致性感知的強化學習框架，無需顯式監督即可優化答案正確性和推理連貫性。GRPO-CARE引入了雙層獎勵機制：（1）基礎獎勵用於答案正確性，（2）自適應一致性獎勵，通過比較模型的推理至答案的可能性（通過一個緩慢演進的參考模型）與同組其他模型來計算。這一雙重機制放大了既正確又邏輯連貫的推理路徑的獎勵。用此自適應獎勵替代KL懲罰後，GRPO-CARE在SEED-Bench-R1上超越了標準GRPO，在最難評估級別上實現了6.7%的性能提升，一致性提高了24.5%。它還展現出強大的遷移能力，在多樣化的視頻理解基準測試中提升了模型性能。我們的工作貢獻了一個系統設計的基準測試和一個可推廣的後訓練框架，推動了更可解釋和魯棒的MLLMs的發展。

English

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

GRPO-CARE：面向多模态推理的一致性感知强化学习

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

摘要

Support