VideoAgent：以大型語言模型作為代理人的長格式影片理解

摘要

長格式影片理解在計算機視覺中是一項重大挑戰，需要一個能夠推理長多模式序列的模型。受人類對長格式影片理解的認知過程啟發，我們強調互動推理和規劃，而非處理冗長視覺輸入的能力。我們引入一個新穎的基於代理的系統，VideoAgent，採用大型語言模型作為中央代理，迭代識別和編譯關鍵信息以回答問題，並以視覺語言基礎模型作為工具來翻譯和檢索視覺信息。在具有挑戰性的EgoSchema和NExT-QA基準測試中，VideoAgent實現了54.1%和71.3%的零-shot準確率，平均僅使用8.4和8.2幀。這些結果顯示我們方法相對於當前最先進方法具有卓越的效果和效率，凸顯了基於代理的方法在推進長格式影片理解方面的潛力。

English

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.

VideoAgent：以大型語言模型作為代理人的長格式影片理解

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

摘要

Support