強化学習が映像理解に与える影響の探求： SEED-Bench-R1からの洞察

要旨

最近のChain of Thought（COT）生成の進展により、大規模言語モデル（LLMs）の推論能力が大幅に向上し、強化学習（RL）が効果的なポストトレーニング手法として注目を集めています。マルチモーダル大規模言語モデル（MLLMs）はこの推論能力を継承していますが、知覚と論理的推論の両方を必要とするタスクではまだ十分に探索されていません。この問題に対処するため、我々はSEED-Bench-R1を導入しました。これは、ビデオ理解におけるMLLMsのポストトレーニング手法を体系的に評価するためのベンチマークです。SEED-Bench-R1には、複雑な現実世界のビデオと日常の計画タスクが多肢選択問題の形式で含まれており、高度な知覚と推論を必要とします。SEED-Bench-R1は、分布内、環境間、環境間タスクの3段階の階層を通じて汎化能力を評価し、容易に検証可能な正解を持つ大規模なトレーニングデータセットを備えています。Qwen2-VL-Instruct-7Bをベースモデルとして使用し、RLと教師ありファインチューニング（SFT）を比較した結果、RLがデータ効率に優れ、分布内および分布外タスクの両方で優れた性能を示し、LongVideoBenchのような一般的なビデオ理解ベンチマークでもSFTを上回ることが明らかになりました。詳細な分析により、RLが視覚的知覚を強化する一方で、論理的に一貫した推論連鎖を生成することが少ないことが判明しました。我々は、一貫性のない推論や見落とされた視覚的キューといった主要な限界を特定し、ベースモデルの推論能力、報酬モデリング、ノイズに対するRLのロバスト性の改善に向けた将来の課題を提案します。

English

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

強化学習が映像理解に与える影響の探求： SEED-Bench-R1からの洞察

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

要旨

Support