VideoChat-R1：強化学習による微調整を介した時空間知覚の強化

要旨

近年の強化学習の進展により、マルチモーダル大規模言語モデル（MLLM）の推論能力が大幅に向上しています。Group Relative Policy Optimization（GRPO）やルールベースの報酬メカニズムといったアプローチは、テキストや画像領域で有望な成果を示していますが、ビデオ理解への応用はまだ限られています。本論文では、ビデオMLLM向けのGRPOを用いた強化学習ファインチューニング（RFT）の体系的探求を提示し、一般的な能力を維持しながら時空間知覚を強化することを目指します。我々の実験では、RFTがタスク固有の改善において非常にデータ効率的であることが明らかになりました。限られたサンプルを用いた時空間知覚目標に対するマルチタスクRFTを通じて、チャット能力を犠牲にすることなく時空間知覚タスクで最先端の性能を達成し、新たな時空間推論能力を示す強力なビデオMLLMであるVideoChat-R1を開発しました。Qwen2.5-VL-7Bと比較して、VideoChat-R1は時間的グラウンディング（+31.8）やオブジェクトトラッキング（+31.2）といったタスクで数倍の性能向上を示しました。さらに、VideoMME（+0.9）、MVBench（+1.0）、Perception Test（+0.9）といった一般的なQAベンチマークでも大幅な改善が見られました。我々の研究結果は、ビデオMLLMの専門タスク強化におけるRFTの可能性を強調しています。本研究成果が、今後のビデオMLLMにおける強化学習研究に貴重な知見を提供することを期待します。

English

Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.

VideoChat-R1：強化学習による微調整を介した時空間知覚の強化

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

要旨

Support