FrameThinker: マルチターンフレームスポットライティングによる長尺動画を用いた思考学習

要旨

大規模視覚言語モデル（LVLM）はビデオ理解において大きな進歩を遂げているものの、長尺ビデオ推論への応用は、均一なフレームサンプリングと静的なテキスト推論によって妨げられており、非効率的で視覚的に高度なビデオタスクを処理するのに苦戦しています。これらの課題を克服するため、本論文では「長尺ビデオを用いた思考」という概念を導入し、新たなフレームワーク「FrameThinker」を提案します。このフレームワーク内で、LVLMはビデオコンテンツを反復的に問い合わせることが可能です。LVLMにこのようなビデオ推論能力を開発することは、特にモデルを新しいビデオアクション（例：フレーム選択）に適応させ、新たに導入されたアクションを採用するようLVLMを導く報酬関数を設計する際に、大きな課題を伴います。これらの課題を解決するため、我々は2段階のトレーニング戦略を提案します。まず、基本的なアクション能力を習得させるために教師ありファインチューニング（SFT）を採用し、その後、戦略的意思決定ポリシーを最適化するために強化学習（RL）を適用します。特にこのRLフェーズでは、各アクションとフォーマット報酬の設計について詳細かつ包括的な探索を行います。Video-Holmes、LongVideo-Reasonなどの推論ベンチマーク、およびLongVideoBench、MLVU、VideoMME、LVBenchなどの長尺ビデオ理解ベンチマークにおける広範な実験により、FrameThinkerがベースラインに対して平均+10.4%の大幅な改善を達成し、処理フレーム数を大幅に削減することが実証されました。特に、7BモデルのFrameThinkerは、LongVideo-Reasonにおいて新たな最先端を確立し、平均わずか20.6フレームを使用して76.1%の精度を達成しました。これは競合するLongVILA-R1（72.0%）を上回るだけでなく、20倍以上少ないフレーム数（対512）で達成しており、比類のない効率性と有効性を実証しています。

English

While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy. Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.

FrameThinker: マルチターンフレームスポットライティングによる長尺動画を用いた思考学習

FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

要旨

Support