강화 학습이 비디오 이해에 미치는 영향 탐구: SEED-Bench-R1에서의 통찰

초록

최근 사고의 연쇄(Chain of Thought, COT) 생성 기술의 발전으로 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력이 크게 향상되었으며, 강화 학습(Reinforcement Learning, RL)이 효과적인 사후 학습(post-training) 접근법으로 부상하고 있습니다. 다중 모달 대규모 언어 모델(Multimodal Large Language Models, MLLMs)은 이러한 추론 잠재력을 물려받았지만, 지각과 논리적 추론이 모두 필요한 작업에서는 아직 충분히 탐구되지 않고 있습니다. 이를 해결하기 위해, 우리는 비디오 이해를 위한 MLLM의 사후 학습 방법을 체계적으로 평가하기 위한 벤치마크인 SEED-Bench-R1을 소개합니다. 이 벤치마크는 복잡한 실세계 비디오와 일상적인 계획 작업을 객관식 질문 형식으로 포함하며, 정교한 지각과 추론을 요구합니다. SEED-Bench-R1은 세 가지 수준의 일반화 시나리오(내부 분포, 교차 환경, 교차 환경-작업)를 통해 일반화 능력을 평가하며, 쉽게 검증 가능한 정답을 포함한 대규모 학습 데이터셋을 제공합니다. Qwen2-VL-Instruct-7B를 기본 모델로 사용하여 RL과 지도 미세 조정(Supervised Fine-Tuning, SFT)을 비교한 결과, RL이 데이터 효율성과 내부 분포 및 외부 분포 작업 모두에서 우수한 성능을 보였으며, LongVideoBench와 같은 일반 비디오 이해 벤치마크에서도 SFT를 능가하는 것으로 나타났습니다. 우리의 상세한 분석은 RL이 시각적 지각을 향상시키지만, 종종 논리적으로 일관성이 떨어지는 추론 체인을 생성한다는 것을 보여줍니다. 또한, 일관성 없는 추론과 간과된 시각적 단서와 같은 주요 한계를 식별하고, 기본 모델의 추론, 보상 모델링, 그리고 잡음 신호에 대한 RL의 견고성 향상을 위한 미래 개선 방향을 제안합니다.

English

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

강화 학습이 비디오 이해에 미치는 영향 탐구: SEED-Bench-R1에서의 통찰

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

초록

Support