ReFoCUS: 맥락적 이해를 위한 강화 학습 기반 프레임 최적화

초록

최근 대규모 멀티모달 모델(Large Multi-modal Models, LMMs)의 발전으로 효과적인 시각-언어 추론이 가능해졌지만, 비디오 콘텐츠를 이해하는 능력은 여전히 최적이 아닌 프레임 선택 전략에 의해 제한되고 있습니다. 기존 접근 방식은 종종 정적 휴리스틱이나 외부 검색 모듈에 의존하여 비디오-LLM에 프레임 정보를 제공하는데, 이는 질의와 관련된 정보를 제공하지 못할 수 있습니다. 본 연구에서는 ReFoCUS(Reinforcement-guided Frame Optimization for Contextual UnderStanding)를 소개합니다. 이는 텍스트 응답에서 시각적 입력 선택으로 최적화 대상을 전환하는 새로운 프레임 수준 정책 최적화 프레임워크입니다. ReFoCUS는 강화 학습을 통해 프레임 선택 정책을 학습하며, 참조 LMM에서 도출된 보상 신호를 사용하여 시간적으로 근거 있는 응답을 가장 잘 지원하는 프레임에 대한 모델의 내재적 선호도를 반영합니다. 큰 조합적 프레임 공간을 효율적으로 탐색하기 위해, 우리는 시간적 일관성을 보장하면서 복잡성을 줄이는 자기회귀적 조건부 선택 아키텍처를 사용합니다. 우리의 접근 방식은 프레임 수준에서 명시적인 지도가 필요하지 않으며, 여러 비디오 QA 벤치마크에서 일관되게 추론 성능을 향상시켜 프레임 선택과 모델 내부 유틸리티를 정렬하는 이점을 강조합니다.

English

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.