深度视频探索:基于工具使用的长视频理解代理搜索
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
May 23, 2025
作者: Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
cs.AI
摘要
长视频理解因其复杂的时空特性及在如此长上下文中的问答难度而面临重大挑战。尽管大型语言模型(LLMs)在视频分析能力和长上下文处理上已展现出显著进步,但在处理信息密集的时长一小时以上的视频时仍存在局限。为克服这些限制,我们提出了深度视频发现代理(Deep Video Discovery agent),采用基于分段视频片段的主动搜索策略。与以往手动设计固定工作流的视频代理不同,我们的方法强调代理的自主性。通过在多粒度视频数据库上提供一套以搜索为核心的工具,我们的DVD代理利用LLM的高级推理能力,基于当前观察状态进行规划,策略性地选择工具,为行动制定合适参数,并根据收集到的信息迭代优化其内部推理。我们在多个长视频理解基准上进行了全面评估,证明了整个系统设计的优势。我们的DVD代理在具有挑战性的LVBench数据集上实现了SOTA性能,大幅超越先前工作。同时,我们还提供了详尽的消融研究和深入的工具分析,为针对长视频理解任务定制的智能代理的进一步发展提供了洞见。代码将于稍后发布。
English
Long-form video understanding presents significant challenges due to
extensive temporal-spatial complexity and the difficulty of question answering
under such extended contexts. While Large Language Models (LLMs) have
demonstrated considerable advancements in video analysis capabilities and long
context handling, they continue to exhibit limitations when processing
information-dense hour-long videos. To overcome such limitations, we propose
the Deep Video Discovery agent to leverage an agentic search strategy over
segmented video clips. Different from previous video agents manually designing
a rigid workflow, our approach emphasizes the autonomous nature of agents. By
providing a set of search-centric tools on multi-granular video database, our
DVD agent leverages the advanced reasoning capability of LLM to plan on its
current observation state, strategically selects tools, formulates appropriate
parameters for actions, and iteratively refines its internal reasoning in light
of the gathered information. We perform comprehensive evaluation on multiple
long video understanding benchmarks that demonstrates the advantage of the
entire system design. Our DVD agent achieves SOTA performance, significantly
surpassing prior works by a large margin on the challenging LVBench dataset.
Comprehensive ablation studies and in-depth tool analyses are also provided,
yielding insights to further advance intelligent agents tailored for long-form
video understanding tasks. The code will be released later.Summary
AI-Generated Summary