FrameThinker：通过多轮帧聚焦学习长视频思维

摘要

尽管大型视觉语言模型（LVLMs）在视频理解方面取得了显著进展，但其在长视频推理中的应用却因统一的帧采样和静态文本推理而受限，这些方法效率低下且难以处理视觉密集型的视频任务。为克服这些挑战，本文提出了“长视频思维”的概念，并设计了一个新颖的框架——FrameThinker。在该框架下，LVLMs能够迭代式地探究视频内容。然而，在LVLMs中开发此类视频推理能力面临显著挑战，特别是在使模型适应新的视频动作（如选择帧）以及设计奖励函数以引导LVLMs采纳新引入的动作方面。为解决这些问题，我们提出了一种两阶段训练策略：首先通过监督微调（SFT）赋予模型基本动作能力，随后采用强化学习（RL）优化策略决策制定。值得注意的是，在RL阶段，我们对每个动作的奖励设计及格式奖励进行了深入全面的探索。在Video-Holmes、LongVideo-Reason等推理基准，以及LongVideoBench、MLVU、VideoMME和LVBench等长视频理解基准上的大量实验表明，FrameThinker相较于基线模型实现了平均+10.4%的显著提升，同时大幅减少了处理的帧数。尤为突出的是，我们的7B模型FrameThinker在LongVideo-Reason上创下了新的最先进水平，仅使用平均20.6帧便达到了76.1%的准确率。这不仅超越了竞争模型LongVILA-R1（72.0%），而且所用帧数减少了超过20倍（对比512帧），展现了无与伦比的效率与效能。

English

While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy. Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.

FrameThinker：通过多轮帧聚焦学习长视频思维

FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

摘要

Support