VideoAgent: 大規模言語モデルをエージェントとした長尺動画理解

要旨

長時間動画理解は、コンピュータビジョンにおける重要な課題であり、長いマルチモーダルシーケンスを推論できるモデルを必要とします。人間の長時間動画理解における認知プロセスに着想を得て、私たちは長い視覚入力を処理する能力よりも、インタラクティブな推論と計画に重点を置いています。本論文では、大規模言語モデルを中心エージェントとして活用し、質問に答えるために重要な情報を反復的に特定・収集する新しいエージェントベースのシステム「VideoAgent」を提案します。このシステムでは、視覚言語基盤モデルが視覚情報を翻訳・検索するツールとして機能します。EgoSchemaとNExT-QAという難易度の高いベンチマークで評価を行った結果、VideoAgentはそれぞれ54.1%と71.3%のゼロショット精度を達成し、平均で8.4フレームと8.2フレームしか使用しませんでした。これらの結果は、私たちの手法が現在の最先端手法を上回る有効性と効率性を示しており、エージェントベースのアプローチが長時間動画理解を進化させる可能性を強調しています。

English

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.

VideoAgent: 大規模言語モデルをエージェントとした長尺動画理解

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

要旨

Support