VideoDetective: 장기 영상 이해를 위한 외부 질의와 내적 관련성을 통한 단서 추적

초록

긴 영상 이해는 제한된 컨텍스트 윈도우로 인해 멀티모달 대규모 언어 모델(MLLM)에게 여전히 어려운 과제이며, 이로 인해 질의와 관련된 희소한 영상 세그먼트를 식별해야 합니다. 그러나 기존 방법들은 주로 질의만을 기준으로 단서를 지역화하여 영상의 내재적 구조와 세그먼트 간 다양한 관련성을 간과해 왔습니다. 이를 해결하기 위해 우리는 긴 영상 질의 응답에서 효과적인 단서 탐색을 위해 질의-세그먼트 관련성과 세그먼트 간 친화도를 통합하는 VideoDetective 프레임워크를 제안합니다. 구체적으로, 우리는 영상을 다양한 세그먼트로 분할하고 시각적 유사성과 시간적 근접성을 기반으로 구축된 시각-시간적 친화도 그래프로 표현합니다. 그런 다음 가설-검증-정교화 루프를 수행하여 관찰된 세그먼트들의 질의 대비 관련성 점수를 추정하고 이를 관찰되지 않은 세그먼트들로 전파하여, 희소 관찰만으로 최종 응답에 가장 중요한 세그먼트의 지역화를 안내하는 전역 관련성 분포를 생성합니다. 실험 결과, 우리 방법은 대표적인 벤치마크에서 다양한 주류 MLLM들에 걸쳐 일관되게 상당한 성능 향상을 달성했으며, VideoMME-long에서 최대 7.5%의 정확도 향상을 보였습니다. 우리의 코드는 https://videodetective.github.io/에서 확인할 수 있습니다.

English

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/

VideoDetective: 장기 영상 이해를 위한 외부 질의와 내적 관련성을 통한 단서 추적

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

초록

Support