深度视频探索：基于工具使用的长视频理解代理搜索

摘要

长视频理解因其复杂的时空特性及在如此长上下文中的问答难度而面临重大挑战。尽管大型语言模型（LLMs）在视频分析能力和长上下文处理上已展现出显著进步，但在处理信息密集的时长一小时以上的视频时仍存在局限。为克服这些限制，我们提出了深度视频发现代理（Deep Video Discovery agent），采用基于分段视频片段的主动搜索策略。与以往手动设计固定工作流的视频代理不同，我们的方法强调代理的自主性。通过在多粒度视频数据库上提供一套以搜索为核心的工具，我们的DVD代理利用LLM的高级推理能力，基于当前观察状态进行规划，策略性地选择工具，为行动制定合适参数，并根据收集到的信息迭代优化其内部推理。我们在多个长视频理解基准上进行了全面评估，证明了整个系统设计的优势。我们的DVD代理在具有挑战性的LVBench数据集上实现了SOTA性能，大幅超越先前工作。同时，我们还提供了详尽的消融研究和深入的工具分析，为针对长视频理解任务定制的智能代理的进一步发展提供了洞见。代码将于稍后发布。

English

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

深度视频探索：基于工具使用的长视频理解代理搜索

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

摘要

Support