Zoom-Zero:通过时间维度精细化增强的从粗到精视频理解方法
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
December 16, 2025
作者: Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma
cs.AI
摘要
基于视频的定位问答(GVQA)旨在定位视频中相关的时间片段并生成准确答案,但大型视频语言模型(LVLM)的时间感知能力有限。尽管现有基于组相对策略优化(GRPO)的方法尝试改进时序定位,但仍难以将答案忠实锚定于相关视频证据,导致时序错位与幻觉。本文提出Zoom-Zero框架,采用由粗到精的处理流程:先定位查询相关片段,再时序聚焦至最显著帧进行细粒度视觉验证。我们通过两项关键创新突破GRPO在GVQA任务中的局限:(i)聚焦精度奖励机制,验证时序定位预测的忠实度,促进对定位帧的细粒度视觉验证;(ii)基于令牌的选择性信用分配,将奖励归因于负责时序定位或答案生成的令牌,缓解GRPO处理多维度奖励信号的缺陷。所提方法显著推进了视频定位问答技术,在NExT-GQA和ReXTime数据集上分别将时序定位精度提升5.2%和4.6%,同时平均答案准确率提高2.4%。推理过程中的由粗到精聚焦机制还通过保留关键视觉细节而不损失全局上下文,助力长视频理解,在长视频基准测试中实现6.4%的平均提升。
English
Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.