GRPO-CARE：面向多模态推理的一致性感知强化学习

摘要

近期强化学习方法，如结果监督的GRPO，在大型语言模型（LLMs）中的链式思维推理方面取得了进展，然而其在多模态大语言模型（MLLMs）中的应用尚未探索。针对MLLM后训练方法缺乏严格评估的问题，我们引入了SEED-Bench-R1，这是一个包含复杂现实世界视频的基准测试，要求平衡的感知与推理能力。它提供了大规模的训练集，并评估了模型在三个逐步升级挑战中的泛化能力：同分布、跨环境及跨环境任务场景。通过SEED-Bench-R1，我们发现标准GRPO虽提升了答案准确性，却常削弱推理步骤与答案间的逻辑连贯性，一致性率仅为57.9%。这源于奖励信号仅关注最终答案，鼓励走捷径，以及严格的KL惩罚限制了探索。为解决此问题，我们提出了GRPO-CARE，一个一致性感知的强化学习框架，无需显式监督即可同时优化答案正确性与推理连贯性。GRPO-CARE引入了双层奖励机制：(1) 基础奖励用于答案正确性，(2) 自适应一致性奖励，通过比较模型推理到答案的似然度（借助缓慢演进的参考模型）与群体同伴计算得出。这一双重机制放大了既正确又逻辑一致的推理路径的奖励。用此自适应奖励替代KL惩罚后，GRPO-CARE在SEED-Bench-R1上超越了标准GRPO，在最难评估级别上实现了6.7%的性能提升，一致性提高了24.5%。它还展现了强大的迁移能力，在多种视频理解基准测试中提升了模型表现。我们的工作贡献了一个系统设计的基准测试和一个可推广的后训练框架，推动了更可解释且鲁棒的MLLMs的发展。

English

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

GRPO-CARE：面向多模态推理的一致性感知强化学习

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

摘要

Support