视频思考者:通过强化学习激发“视频思维”
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
October 27, 2025
作者: Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng
cs.AI
摘要
近期图像推理方法(尤其是"图像思维"范式)在多模态大语言模型(MLLMs)中取得显著成功,但这一动态推理范式尚未扩展至视频推理任务。本文提出Video-Thinker框架,通过自主调用模型固有的"定位"与"描述"能力在推理过程中生成线索,使MLLMs具备视频思维能力。为激发此能力,我们构建了Video-Thinker-10K数据集,该精选数据集在思维链推理序列中呈现了自主工具使用特性。我们的训练策略首先采用监督微调(SFT)学习推理格式,继而通过分组相对策略优化(GRPO)强化推理能力。该方法使MLLMs能自主执行视频推理中的定位与描述任务,无需构建和调用外部工具。大量实验表明,Video-Thinker在领域内任务及具有挑战性的领域外视频推理基准(包括Video-Holmes、CG-Bench-Reasoning和VRBench)上均取得显著性能提升。我们的Video-Thinker-7B模型显著超越Video-R1等现有基线,在7B规模MLLMs中确立了最先进的性能水平。
English
Recent advances in image reasoning methods, particularly "Thinking with
Images", have demonstrated remarkable success in Multimodal Large Language
Models (MLLMs); however, this dynamic reasoning paradigm has not yet been
extended to video reasoning tasks. In this paper, we propose Video-Thinker,
which empowers MLLMs to think with videos by autonomously leveraging their
intrinsic "grounding" and "captioning" capabilities to generate reasoning clues
throughout the inference process. To spark this capability, we construct
Video-Thinker-10K, a curated dataset featuring autonomous tool usage within
chain-of-thought reasoning sequences. Our training strategy begins with
Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group
Relative Policy Optimization (GRPO) to strengthen this reasoning capability.
Through this approach, Video-Thinker enables MLLMs to autonomously navigate
grounding and captioning tasks for video reasoning, eliminating the need for
constructing and calling external tools. Extensive experiments demonstrate that
Video-Thinker achieves significant performance gains on both in-domain tasks
and challenging out-of-domain video reasoning benchmarks, including
Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B
substantially outperforms existing baselines such as Video-R1 and establishes
state-of-the-art performance among 7B-sized MLLMs.