VideoDeepResearch：基于智能工具的长视频理解系统

摘要

长视频理解（LVU）因其任务固有的复杂性和上下文窗口限制，对当前的多模态大语言模型（MLLMs）构成了重大挑战。普遍认为，解决LVU任务需要具备扩展上下文窗口、强大视觉感知能力及精通领域知识的基础MLLMs。在本研究中，我们通过引入VideoDeepResearch这一新颖的代理框架，对长视频理解的传统观念提出挑战。我们的方法仅依赖于一个纯文本的大型推理模型（LRM），结合模块化的多模态工具包，包括多模态检索器和视觉感知器，这些在实际中均易于获取。针对每项LVU任务，系统通过推理制定问题解决策略，同时有选择地访问并利用关键视频内容。我们在MLVU、Video-MME及LVBench等主流LVU基准测试上进行了广泛实验。结果显示，VideoDeepResearch相较于现有MLLM基线取得了显著提升，在MLVU（测试）、LVBench和LongVideoBench上分别超越了之前的最优水平9.6%、6.6%和3.9%。这些发现凸显了代理系统在攻克LVU问题核心挑战方面的潜力。

English

Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

VideoDeepResearch：基于智能工具的长视频理解系统

VideoDeepResearch: Long Video Understanding With Agentic Tool Using

摘要

Support