ReFoCUS: 文脈理解のための強化学習ガイドによるフレーム最適化

要旨

大規模マルチモーダルモデル（LMMs）の最近の進展により、視覚と言語の推論が効果的に行えるようになりましたが、ビデオコンテンツの理解能力は、最適でないフレーム選択戦略によって制限されています。既存のアプローチでは、静的なヒューリスティクスや外部検索モジュールに依存してビデオ-LLMにフレーム情報を供給することが多く、これではクエリに関連する情報を提供できない場合があります。本研究では、ReFoCUS（Reinforcement-guided Frame Optimization for Contextual UnderStanding）を提案します。これは、最適化の対象をテキスト応答から視覚的入力選択にシフトする新しいフレームレベルのポリシー最適化フレームワークです。ReFoCUSは、強化学習を用いてフレーム選択ポリシーを学習し、参照LMMから得られる報酬信号を使用して、時間的に根拠のある応答を最もよくサポートするフレームに対するモデルの内在的な選好を反映します。大きな組み合わせフレーム空間を効率的に探索するために、時間的整合性を保ちつつ複雑さを低減する自己回帰的で条件付きの選択アーキテクチャを採用しています。本アプローチはフレームレベルでの明示的な監督を必要とせず、複数のビデオQAベンチマークで一貫して推論性能を向上させ、フレーム選択とモデル内部の有用性を整合させる利点を強調しています。

English

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.

ReFoCUS: 文脈理解のための強化学習ガイドによるフレーム最適化

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

要旨

Support