Zoom-Zero:通过时序精细化增强的从粗到精视频理解方法
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
December 16, 2025
作者: Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma
cs.AI
摘要
基於視頻的問答定位(GVQA)旨在定位視頻中相關的時間片段並生成準確答案,但大型視頻語言模型(LVLM)表現出有限的時間感知能力。儘管現有基於群組相對策略優化(GRPO)的方法試圖改進時間定位,仍難以將答案忠實地錨定於相關視頻證據,導致時間錯位與虛構內容。本文提出Zoom-Zero框架,採用由粗到精的處理流程:先定位查詢相關片段,再對關鍵幀進行時間維度的細粒度視覺驗證。我們通過兩項關鍵創新突破GRPO在GVQA任務中的侷限:(1)引入縮放精度獎勵機制,驗證時間定位預測的可靠性,並促進對定位幀的細粒度視覺校驗;(2)提出令牌選擇性信用分配方法,將獎勵歸因於負責時間定位或答案生成的令牌,緩解GRPO處理多維獎勵信號的不足。所提方法顯著推動了視頻問答定位的發展,在NExT-GQA和ReXTime數據集上分別將時間定位準確率提升5.2%和4.6%,同時將平均答案準確率提高2.4%。此外,推理階段的由粗到精縮放機制通過保留關鍵視覺細節而不損害全局語境,進一步提升長視頻理解能力,在長視頻基準測試中實現6.4%的平均提升。
English
Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.