장기간 비디오에 대한 강화학습 확장

초록

우리는 강화 학습을 활용하여 비전-언어 모델(VLMs)의 장기 비디오 추론 능력을 확장하는 풀스택 프레임워크를 소개합니다. 장기 비디오 추론의 독특한 도전 과제를 해결하기 위해 세 가지 핵심 요소를 통합했습니다: (1) 스포츠, 게임, 블로그 등 다양한 도메인에 걸쳐 고품질 추론 주석이 포함된 52K 장기 비디오 QA 쌍으로 구성된 대규모 데이터셋인 LongVideo-Reason; (2) 사고 사슬 지도 미세 조정(CoT-SFT)과 강화 학습(RL)을 통해 VLMs를 확장하는 두 단계의 학습 파이프라인; (3) 시퀀스 병렬 처리와 장기 비디오에 맞춤화된 vLLM 기반 엔진을 통합한 Multi-modal Reinforcement Sequence Parallelism (MR-SP)이라는 장기 비디오 RL 학습 인프라로, 캐시된 비디오 임베딩을 사용하여 효율적인 롤아웃과 프리필링을 지원합니다. 실험에서 LongVILA-R1-7B는 VideoMME와 같은 장기 비디오 QA 벤치마크에서 강력한 성능을 보였습니다. 또한 LongVideo-Reason-eval 벤치마크에서 Video-R1-7B를 능가하고, 시간적 추론, 목적 및 의도 추론, 공간적 추론, 플롯 추론에서 Gemini-1.5-Pro와도 견줄 만한 성능을 보였습니다. 특히, MR-SP 시스템은 장기 비디오 RL 학습에서 최대 2.1배의 속도 향상을 달성했습니다. LongVILA-R1은 입력 비디오 프레임 수가 증가함에 따라 일관된 성능 향상을 보였습니다. LongVILA-R1은 VLMs의 장기 비디오 추론을 위한 확실한 한 걸음을 내딛었습니다. 또한, 우리는 다양한 모달리티(비디오, 텍스트, 오디오), 다양한 모델(VILA 및 Qwen 시리즈), 심지어 이미지 및 비디오 생성 모델에 대한 RL 학습을 지원하는 학습 시스템을 공개했습니다. 단일 A100 노드(8 GPU)에서 1시간 길이의 비디오(예: 3,600 프레임 / 약 256k 토큰)에 대한 RL 학습을 지원합니다.

English

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

장기간 비디오에 대한 강화학습 확장

Scaling RL to Long Videos

초록

Support