VideoDeepResearch: エージェントツールを用いた長尺動画理解

要旨

長尺動画理解（LVU）は、その本質的な複雑さとコンテキストウィンドウの制約により、現在のマルチモーダル大規模言語モデル（MLLM）にとって重要な課題となっています。一般的に、LVUタスクに対処するためには、拡張されたコンテキストウィンドウ、強力な視覚認識能力、および熟練したドメイン知識を備えた基盤MLLMが必要であると広く考えられています。本研究では、この通説に挑戦し、長尺動画理解のための新しいエージェント型フレームワークであるVideoDeepResearchを提案します。私たちのアプローチは、テキストのみの大規模推論モデル（LRM）と、マルチモーダル検索ツールや視覚認識ツールを含むモジュール型マルチモーダルツールキットに依存しており、これらは実際に容易に利用可能です。各LVUタスクに対して、システムは推論を通じて問題解決戦略を策定し、ツール使用を通じて必要な動画コンテンツを選択的にアクセスし活用します。MLVU、Video-MME、LVBenchなどの人気のあるLVUベンチマークで広範な実験を行いました。その結果、VideoDeepResearchは既存のMLLMベースラインを大幅に上回り、MLVU（テスト）、LVBench、LongVideoBenchにおいてそれぞれ9.6%、6.6%、3.9%の改善を達成し、従来の最先端を凌駕しました。これらの発見は、LVU問題の主要な課題を克服する上でエージェント型システムの可能性を強調しています。

English

Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

VideoDeepResearch: エージェントツールを用いた長尺動画理解

VideoDeepResearch: Long Video Understanding With Agentic Tool Using

要旨

Support