VideoAgent:以大型語言模型作為代理人的長格式影片理解
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
March 15, 2024
作者: Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
cs.AI
摘要
長格式影片理解在計算機視覺中是一項重大挑戰,需要一個能夠推理長多模式序列的模型。受人類對長格式影片理解的認知過程啟發,我們強調互動推理和規劃,而非處理冗長視覺輸入的能力。我們引入一個新穎的基於代理的系統,VideoAgent,採用大型語言模型作為中央代理,迭代識別和編譯關鍵信息以回答問題,並以視覺語言基礎模型作為工具來翻譯和檢索視覺信息。在具有挑戰性的EgoSchema和NExT-QA基準測試中,VideoAgent實現了54.1%和71.3%的零-shot準確率,平均僅使用8.4和8.2幀。這些結果顯示我們方法相對於當前最先進方法具有卓越的效果和效率,凸顯了基於代理的方法在推進長格式影片理解方面的潛力。
English
Long-form video understanding represents a significant challenge within
computer vision, demanding a model capable of reasoning over long multi-modal
sequences. Motivated by the human cognitive process for long-form video
understanding, we emphasize interactive reasoning and planning over the ability
to process lengthy visual inputs. We introduce a novel agent-based system,
VideoAgent, that employs a large language model as a central agent to
iteratively identify and compile crucial information to answer a question, with
vision-language foundation models serving as tools to translate and retrieve
visual information. Evaluated on the challenging EgoSchema and NExT-QA
benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only
8.4 and 8.2 frames used on average. These results demonstrate superior
effectiveness and efficiency of our method over the current state-of-the-art
methods, highlighting the potential of agent-based approaches in advancing
long-form video understanding.