GRPO-CARE: 다중모드 추론을 위한 일관성 인지 강화 학습

초록

최근 결과 지도 학습 GRPO와 같은 강화 학습 접근법은 대형 언어 모델(LLM)에서의 사고 연쇄(Chain-of-Thought) 추론을 발전시켰으나, 이를 다중 모달 LLM(MLLM)에 적용하는 연구는 아직 이루어지지 않았다. MLLM 사후 학습 방법에 대한 엄격한 평가가 부족한 문제를 해결하기 위해, 우리는 복잡한 실세계 비디오를 포함하여 균형 잡힌 인지와 추론을 요구하는 벤치마크인 SEED-Bench-R1을 소개한다. 이 벤치마크는 대규모 학습 데이터셋을 제공하며, 세 가지 점진적으로 증가하는 도전 과제(분포 내, 환경 간, 환경-작업 간 시나리오)에서의 일반화 능력을 평가한다. SEED-Bench-R1을 사용하여, 표준 GRPO가 답변 정확도를 향상시키는 반면, 추론 단계와 답변 간의 논리적 일관성을 감소시키며, 일관성 비율이 57.9%에 불과함을 발견했다. 이는 보상 신호가 최종 답변에만 초점을 맞추어 단축 경로를 유도하고, 엄격한 KL 페널티가 탐색을 제한하기 때문이다. 이를 해결하기 위해, 우리는 답변 정확성과 추론 일관성을 명시적 지도 없이 최적화하는 일관성 인식 강화 학습 프레임워크인 GRPO-CARE를 제안한다. GRPO-CARE는 두 가지 계층의 보상을 도입한다: (1) 답변 정확성을 위한 기본 보상, (2) 모델의 추론-답변 가능성(느리게 진화하는 참조 모델을 통해 계산)을 그룹 동료들과 비교하여 계산되는 적응형 일관성 보너스. 이 이중 메커니즘은 정확하고 논리적으로 일관된 추론 경로에 대한 보상을 증폭시킨다. KL 페널티를 이 적응형 보너스로 대체함으로써, GRPO-CARE는 SEED-Bench-R1에서 표준 GRPO를 능가하며, 가장 어려운 평가 수준에서 6.7%의 성능 향상과 일관성에서 24.5%의 개선을 달성했다. 또한, 다양한 비디오 이해 벤치마크에서 모델 성능을 향상시키는 강력한 전이 능력을 보여준다. 우리의 연구는 체계적으로 설계된 벤치마크와 일반화 가능한 사후 학습 프레임워크를 제공함으로써, 더 해석 가능하고 견고한 MLLM 개발을 진전시킨다.

English

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

GRPO-CARE: 다중모드 추론을 위한 일관성 인지 강화 학습

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

초록

Support