ChatPaper.aiChatPaper

VideoChat-R1:通過強化微調提升時空感知能力

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

April 9, 2025
作者: Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang
cs.AI

摘要

近期強化學習的進展顯著提升了多模態大型語言模型(MLLMs)的推理能力。雖然群組相對策略優化(GRPO)和基於規則的獎勵機制在文本和圖像領域展現出潛力,但它們在視頻理解中的應用仍有限。本文系統地探討了將GRPO應用於視頻MLLMs的強化微調(RFT),旨在增強時空感知能力,同時保持模型的通用能力。我們的實驗表明,RFT在特定任務改進上具有極高的數據效率。通過在有限樣本上進行時空感知目標的多任務RFT,我們開發了VideoChat-R1,這是一個強大的視頻MLLM,在時空感知任務上達到了最先進的性能,且未犧牲聊天能力,同時展現出新興的時空推理能力。與Qwen2.5-VL-7B相比,VideoChat-R1在時間定位(+31.8)和目標跟踪(+31.2)等任務中的性能提升了數倍。此外,它在通用問答基準測試如VideoMME(+0.9)、MVBench(+1.0)和Perception Test(+0.9)上也有顯著提升。我們的研究結果強調了RFT在視頻MLLMs專項任務增強中的潛力。我們希望這項工作能為未來視頻MLLMs的強化學習研究提供寶貴的見解。
English
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.
PDF112April 10, 2025