ChatPaper.aiChatPaper

VideoChat-R1:通过强化微调提升时空感知能力

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

April 9, 2025
作者: Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang
cs.AI

摘要

近期强化学习领域的进展显著提升了多模态大语言模型(MLLMs)的推理能力。尽管诸如群体相对策略优化(GRPO)和基于规则的奖励机制等方法在文本和图像领域展现出潜力,但它们在视频理解中的应用仍较为有限。本文系统性地探索了将GRPO应用于视频MLLMs的强化微调(RFT),旨在增强时空感知能力的同时保持模型的通用性能。实验表明,RFT在任务特定改进方面具有极高的数据效率。通过在有限样本上对时空感知目标进行多任务RFT,我们开发了VideoChat-R1,这是一个强大的视频MLLM,在时空感知任务上实现了最先进的性能,且未牺牲聊天能力,同时展现出新兴的时空推理能力。与Qwen2.5-VL-7B相比,VideoChat-R1在时间定位(+31.8)和对象跟踪(+31.2)等任务中性能提升数倍。此外,它在通用问答基准测试如VideoMME(+0.9)、MVBench(+1.0)和Perception Test(+0.9)上也有显著提升。我们的研究结果强调了RFT在视频MLLMs特定任务增强中的潜力。希望我们的工作能为未来视频MLLMs的强化学习研究提供有价值的见解。
English
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.
PDF112April 10, 2025