ReFoCUS：基於強化學習的框架優化，用於上下文理解

摘要

近期，大型多模態模型（LMMs）的進展已實現了高效的視覺-語言推理，然而對視頻內容的理解能力仍受制於次優的幀選擇策略。現有方法通常依賴靜態啟發式規則或外部檢索模塊來向視頻-LLMs提供幀信息，這可能無法提供與查詢相關的信息。在本研究中，我們提出了ReFoCUS（基於強化的上下文理解幀優化），這是一種新穎的幀級策略優化框架，它將優化目標從文本響應轉向視覺輸入選擇。ReFoCUS通過強化學習來學習幀選擇策略，利用源自參考LMM的獎勵信號來反映模型對最能支持時間基礎響應的幀的內在偏好。為了高效探索龐大的組合幀空間，我們採用了一種自迴歸的條件選擇架構，確保時間連貫性同時降低複雜度。我們的方法無需在幀級別進行顯式監督，並在多個視頻問答基準測試中持續提升推理性能，凸顯了將幀選擇與模型內部效用對齊的優勢。

English

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.

ReFoCUS：基於強化學習的框架優化，用於上下文理解

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

摘要

Support