將強化學習擴展至長視頻處理

摘要

我們提出了一個全棧框架，該框架利用強化學習來擴展視覺語言模型（VLMs）在長視頻中的推理能力。我們通過整合三個關鍵組件來應對長視頻推理的獨特挑戰：（1）一個大規模數據集LongVideo-Reason，包含52K個長視頻問答對，涵蓋體育、遊戲和視頻博客等多個領域的高質量推理註釋；（2）一個兩階段訓練管道，通過鏈式思維監督微調（CoT-SFT）和強化學習（RL）擴展VLMs；（3）一個專為長視頻RL設計的訓練基礎設施，名為多模態強化序列並行（MR-SP），該設施結合了序列並行性和一個基於vLLM的引擎，利用緩存的視頻嵌入進行高效的滾動和預填充。在實驗中，LongVILA-R1-7B在長視頻問答基準測試如VideoMME上表現出色。它還超越了Video-R1-7B，並在我們的LongVideo-Reason-eval基準測試中，在時間推理、目標和目的推理、空間推理以及情節推理方面與Gemini-1.5-Pro相匹配。值得注意的是，我們的MR-SP系統在長視頻RL訓練上實現了高達2.1倍的加速。LongVILA-R1在輸入視頻幀數增加時表現出持續的性能提升。LongVILA-R1標誌著VLMs在長視頻推理方面邁出了堅實的一步。此外，我們公開了我們的訓練系統，該系統支持在多種模態（視頻、文本和音頻）、多種模型（VILA和Qwen系列）以及圖像和視頻生成模型上進行RL訓練。在單個A100節點（8個GPU）上，它支持對長達一小時的視頻（例如，3,600幀/約256k個令牌）進行RL訓練。

English

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

將強化學習擴展至長視頻處理

Scaling RL to Long Videos

摘要

Support