视频侦探:基于外部查询与内部关联的长视频理解线索追踪
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
March 23, 2026
作者: Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
cs.AI
摘要
由于上下文窗口有限,长视频理解对多模态大语言模型(MLLMs)仍具挑战性,这需要识别与查询相关的稀疏视频片段。然而,现有方法主要仅基于查询定位线索,忽略了视频的内在结构及各片段间的差异相关性。为此,我们提出VideoDetective框架,通过整合查询-片段相关性与片段间关联性,实现长视频问答中的高效线索搜寻。具体而言,我们将视频分割为多个片段,并基于视觉相似性和时序邻近性构建视觉-时序关联图。随后通过假设-验证-优化循环计算已观测片段与查询的相关性分数,并将其传播至未观测片段,最终生成全局相关性分布以指导关键片段的定位,实现稀疏观测下的精准回答。实验表明,该方法在主流MLLMs和代表性基准测试中均取得显著提升,在VideoMME-long数据集上准确率最高提升7.5%。代码已开源:https://videodetective.github.io/
English
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/