ChatPaper.aiChatPaper

影片偵探:基於外部查詢與內部關聯性的線索追蹤技術實現長影片理解

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

March 23, 2026
作者: Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
cs.AI

摘要

由於上下文窗口有限,長影片理解對多模態大語言模型(MLLMs)仍是挑戰,這需要識別稀疏的查詢相關影片片段。然而現有方法主要僅基於查詢進行線索定位,忽略了影片的內在結構與不同片段的差異化相關性。為解決此問題,我們提出VideoDetective框架,該框架整合查詢-片段相關性與片段間親和力,以實現長影片問答中的高效線索搜尋。具體而言,我們將影片分割為多個片段,並基於視覺相似性與時間鄰近性構建視覺-時間親和圖來表徵片段關係。隨後通過假設-驗證-優化循環流程,計算已觀測片段與查詢的相關性分數,並將其傳播至未觀測片段,從而生成全局相關性分佈以指導關鍵片段定位,最終實現稀疏觀測下的精準回答。實驗表明,我們的方法在多個主流MLLMs與代表性基準測試中均取得顯著提升,其中VideoMME-long數據集的準確率最高提升7.5%。程式碼已開源於https://videodetective.github.io/。
English
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
PDF452March 25, 2026