VideoDetective: 長尺動画理解のための外部クエリと内部関連性に基づく手がかり探索

要旨

長文ビデオ理解は、マルチモーダル大規模言語モデル（MLLM）において、限られたコンテキストウィンドウの制約により依然として課題となっている。この制約に対処するには、クエリに関連する疎なビデオセグメントを特定する必要がある。しかし、既存手法の多くはクエリのみに基づいて手がかりの局所化を行うため、ビデオの内在的構造やセグメント間の関連性の差異を十分に考慮していない。この問題に対処するため、我々は長文ビデオ質問応答において、クエリとセグメントの関連性とセグメント間の親和性を統合的に活用するフレームワーク「VideoDetective」を提案する。具体的には、ビデオを複数のセグメントに分割し、視覚的類似性と時間的近接性に基づく視覚-時間親和性グラフとして表現する。その後、仮説-検証-洗練化のループを実行し、観測済みセグメントのクエリに対する関連性スコアを推定するとともに、未観測セグメントへのスコア伝播を行い、疎な観測に基づく最終回答に必要な重要セグメントの局所化を導く全球的な関連性分布を生成する。実験結果から、本手法が代表的なベンチマークにおいて広範な主流MLLMで一貫して大幅な性能向上を達成し、VideoMME-longでは最大7.5%の精度向上を実現することが示された。実装コードはhttps://videodetective.github.io/で公開している。

English

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/

VideoDetective: 長尺動画理解のための外部クエリと内部関連性に基づく手がかり探索

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

要旨

Support