VideoDeepResearch：利用代理工具进行长视频理解

摘要

長視頻理解（LVU）對當前的多模態大語言模型（MLLMs）提出了重大挑戰，這主要源於任務本身的複雜性以及上下文窗口的限制。普遍認為，解決LVU任務需要具備擴展上下文窗口、強大視覺感知能力和熟練領域知識的基礎MLLMs。在本研究中，我們通過引入VideoDeepResearch這一新穎的代理框架來挑戰這一普遍觀點，該框架專注於長視頻理解。我們的方法僅依賴於一個純文本的大型推理模型（LRM），並結合模塊化的多模態工具包，包括多模態檢索器和視覺感知器，這些工具在實際應用中均易於獲取。針對每個LVU任務，系統通過推理制定問題解決策略，同時選擇性地訪問並利用關鍵視頻內容。我們在流行的LVU基準測試上進行了廣泛實驗，包括MLVU、Video-MME和LVBench。結果顯示，VideoDeepResearch在現有MLLM基線基礎上取得了顯著提升，在MLVU（測試）、LVBench和LongVideoBench上分別超越了之前的最先進水平9.6%、6.6%和3.9%。這些發現凸顯了代理系統在克服LVU問題關鍵挑戰方面的潛力。

English

Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

VideoDeepResearch：利用代理工具进行长视频理解

VideoDeepResearch: Long Video Understanding With Agentic Tool Using

摘要

Support