Video-R1: MLLMにおけるビデオ推論の強化

要旨

DeepSeek-R1がルールベースの強化学習（RL）を通じて推論能力を引き出すことに成功したことに触発され、我々はマルチモーダル大規模言語モデル（MLLMs）におけるビデオ推論を引き出すためのR1パラダイムを体系的に探求する最初の試みとしてVideo-R1を導入しました。しかし、GRPOアルゴリズムを用いたRLトレーニングをビデオ推論に直接適用するには、主に2つの課題があります：(i) ビデオ推論のための時間的モデリングの欠如、(ii) 高品質なビデオ推論データの不足。これらの問題に対処するため、我々はまず、ビデオの時間的情報を活用して推論を行うことを促すT-GRPOアルゴリズムを提案しました。さらに、ビデオデータのみに依存するのではなく、高品質な画像推論データをトレーニングプロセスに組み込みました。我々は、SFTコールドスタート用のVideo-R1-COT-165kとRLトレーニング用のVideo-R1-260kという2つのデータセットを構築し、どちらも画像とビデオデータで構成されています。実験結果は、Video-R1がVideoMMMUやVSI-Benchなどのビデオ推論ベンチマーク、およびMVBenchやTempCompassなどの一般的なビデオベンチマークにおいて、大幅な改善を達成したことを示しています。特に、Video-R1-7Bはビデオ空間推論ベンチマークVSI-benchで35.8%の精度を達成し、商用のプロプライエタリモデルGPT-4oを上回りました。すべてのコード、モデル、データが公開されています。

English

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.