ChatPaper.aiChatPaper

Video-Thinker:透過強化學習激發「影片思維」能力

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

October 27, 2025
作者: Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng
cs.AI

摘要

近期圖像推理方法(特別是「以圖思考」技術)在多模態大型語言模型(MLLMs)中展現出顯著成效,然而這種動態推理範式尚未擴展至影片推理任務。本文提出Video-Thinker框架,通過自主調用模型內建的「定位」與「描述」能力,在推理過程中生成線索,使MLLMs具備以影片進行思考的能力。為激發此能力,我們構建了Video-Thinker-10K數據集,該精選數據集特點在於將自主工具使用融入思維鏈推理序列。我們的訓練策略首先採用監督微調(SFT)學習推理格式,再通過群組相對策略優化(GRPO)強化推理能力。藉由此方法,Video-Thinker使MLLMs能自主執行影片推理中的定位與描述任務,無需構建或調用外部工具。大量實驗表明,Video-Thinker在領域內任務及具挑戰性的領域外影片推理基準(包括Video-Holmes、CG-Bench-Reasoning與VRBench)上均實現顯著性能提升。我們的Video-Thinker-7B模型大幅超越Video-R1等現有基線,在7B規模的MLLMs中建立了最先進的性能標竿。
English
Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
PDF831December 2, 2025