将强化学习扩展至长视频处理

摘要

我们推出了一套全栈框架，通过强化学习技术，将视觉语言模型（VLMs）的推理能力扩展至长视频领域。针对长视频推理的独特挑战，该框架整合了三大关键组件：（1）大规模数据集LongVideo-Reason，包含52,000对长视频问答，覆盖体育、游戏、vlog等多个领域，并配有高质量推理标注；（2）两阶段训练流程，首先通过链式思维监督微调（CoT-SFT）扩展VLMs，随后应用强化学习（RL）进行优化；（3）专为长视频RL设计的训练基础设施——多模态强化序列并行（MR-SP），它结合了序列并行技术与基于vLLM的引擎，利用缓存视频嵌入实现高效的rollout和预填充，特别适配长视频处理。实验表明，LongVILA-R1-7B在VideoMME等长视频问答基准测试中表现优异，不仅在时间推理、目标与意图推理、空间推理及情节推理上超越Video-R1-7B，甚至在LongVideo-Reason-eval基准上与Gemini-1.5-Pro旗鼓相当。特别值得一提的是，MR-SP系统在长视频RL训练中实现了最高2.1倍的加速。随着输入视频帧数的增加，LongVILA-R1展现出持续的性能提升，标志着VLMs在长视频推理领域迈出了坚实的一步。此外，我们公开了训练系统，支持多种模态（视频、文本、音频）、多种模型（VILA和Qwen系列）乃至图像与视频生成模型的RL训练。在单节点A100（8 GPU）配置下，该系统能够支持长达一小时视频（例如3,600帧/约256k tokens）的RL训练。

English

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

将强化学习扩展至长视频处理

Scaling RL to Long Videos

摘要

Support