ディープ・ビデオ・ディスカバリー：長尺動画理解のためのツール活用型エージェント検索

要旨

長時間動画の理解は、広範な時空間的複雑性と、そのような拡張された文脈下での質問応答の難しさにより、重要な課題を提示している。大規模言語モデル（LLMs）は、動画分析能力と長い文脈の処理において大きな進歩を示しているものの、情報密度の高い1時間以上の動画を処理する際には依然として限界がある。このような限界を克服するため、我々はセグメント化された動画クリップに対してエージェント型検索戦略を活用するDeep Video Discoveryエージェントを提案する。従来の動画エージェントが手動で設計した硬直的なワークフローとは異なり、我々のアプローチはエージェントの自律性を重視している。マルチグラニュラリティの動画データベース上で検索中心のツールセットを提供することにより、DVDエージェントはLLMの高度な推論能力を活用して現在の観察状態に基づいて計画を立て、戦略的にツールを選択し、アクションのための適切なパラメータを策定し、収集された情報に基づいて内部推論を反復的に洗練する。我々は、複数の長時間動画理解ベンチマークにおいて包括的な評価を行い、システム設計全体の優位性を実証した。我々のDVDエージェントは、挑戦的なLVBenchデータセットにおいて、従来の研究を大幅に上回るSOTA性能を達成した。また、包括的なアブレーション研究と詳細なツール分析も提供され、長時間動画理解タスクに特化したインテリジェントエージェントをさらに進化させるための洞察が得られた。コードは後日公開される予定である。

English

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.