長尺動画への強化学習のスケーリング

要旨

長編動画における視覚言語モデル（VLM）の推論能力を拡張するためのフルスタックフレームワークを紹介する。本手法では、強化学習を活用し、長編動画推論に特有の課題に対処するために、以下の3つの重要なコンポーネントを統合している：(1) スポーツ、ゲーム、ブログなど多様なドメインにわたる高品質な推論アノテーションを備えた52Kの長編動画QAペアからなる大規模データセット「LongVideo-Reason」、(2) 連鎖的思考（Chain-of-Thought）による教師ありファインチューニング（CoT-SFT）と強化学習（RL）を用いてVLMを拡張する2段階のトレーニングパイプライン、(3) 長編動画RLのためのトレーニング基盤「Multi-modal Reinforcement Sequence Parallelism（MR-SP）」を開発。MR-SPは、シーケンス並列処理とvLLMベースのエンジンを組み合わせ、キャッシュされた動画埋め込みを活用して効率的なロールアウトとプリフィリングを実現する。実験では、LongVILA-R1-7BがVideoMMEなどの長編動画QAベンチマークで高い性能を発揮。さらに、Video-R1-7Bを上回り、Gemini-1.5-Proと同等の性能を、LongVideo-Reason-evalベンチマークにおける時間的推論、目的推論、空間推論、プロット推論で達成した。特に、MR-SPシステムは長編動画RLトレーニングにおいて最大2.1倍の高速化を実現。LongVILA-R1は、入力動画フレーム数が増加しても一貫した性能向上を示し、VLMにおける長編動画推論への確かな一歩を記した。さらに、本トレーニングシステムを公開し、動画、テキスト、音声など多様なモダリティ、VILAやQwenシリーズなどのモデル、さらには画像・動画生成モデルに対応したRLトレーニングをサポートする。単一のA100ノード（8GPU）において、1時間の長編動画（例：3,600フレーム／約256kトークン）のRLトレーニングを可能にしている。

English

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

長尺動画への強化学習のスケーリング

Scaling RL to Long Videos

要旨

Support