FrameThinker：通過多輪幀聚焦學習長視頻思考

摘要

尽管大型视觉语言模型（LVLMs）在视频理解方面取得了显著进展，但其在长视频推理中的应用却因统一的帧采样和静态文本推理而受限，这些方法效率低下且难以处理视觉密集型的视频任务。为克服这些挑战，本文提出了“长视频思维”的概念，并引入了一种新颖的框架——FrameThinker。在该框架内，LVLMs能够迭代地探究视频内容。在LVLMs中开发此类视频推理能力面临显著挑战，特别是在使模型适应新的视频动作（如选择帧）以及设计奖励函数以引导LVLMs采纳新引入的动作方面。为解决这些难题，我们提出了一种两阶段训练策略：首先采用监督微调（SFT）来培养基本动作能力，随后通过强化学习（RL）优化战略决策策略。值得注意的是，在RL阶段，我们对每个动作的奖励设计及格式奖励进行了深入且全面的探索。在Video-Holmes、LongVideo-Reason等推理基准测试，以及LongVideoBench、MLVU、VideoMME和LVBench等长视频理解基准测试上的大量实验表明，FrameThinker相较于基线模型实现了平均+10.4%的显著提升，同时大幅减少了处理的帧数。尤为突出的是，我们的7B模型FrameThinker在LongVideo-Reason上树立了新的技术标杆，仅使用平均20.6帧便达到了76.1%的准确率。这不仅超越了竞争对手LongVILA-R1（72.0%），而且使用的帧数减少了超过20倍（对比512帧），展现了无与伦比的效率与效能。

English

While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy. Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.

FrameThinker：通過多輪幀聚焦學習長視頻思考

FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

摘要

Support