视频代理:利用大型语言模型作为代理的长视频理解
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
March 15, 2024
作者: Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
cs.AI
摘要
长篇视频理解在计算机视觉中是一个重大挑战,需要一种能够推理长时间多模态序列的模型。受人类认知过程对长篇视频理解的启发,我们强调互动推理和规划,而不是处理长篇视觉输入的能力。我们引入了一种新颖的基于代理的系统,VideoAgent,它采用一个大型语言模型作为中央代理,迭代地识别和整理关键信息以回答问题,同时利用视觉-语言基础模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中,VideoAgent 在零样本准确率上分别达到了54.1%和71.3%,平均仅使用了8.4和8.2帧。这些结果表明我们的方法在效果和效率上优于当前最先进的方法,突显了基于代理的方法在推进长篇视频理解方面的潜力。
English
Long-form video understanding represents a significant challenge within
computer vision, demanding a model capable of reasoning over long multi-modal
sequences. Motivated by the human cognitive process for long-form video
understanding, we emphasize interactive reasoning and planning over the ability
to process lengthy visual inputs. We introduce a novel agent-based system,
VideoAgent, that employs a large language model as a central agent to
iteratively identify and compile crucial information to answer a question, with
vision-language foundation models serving as tools to translate and retrieve
visual information. Evaluated on the challenging EgoSchema and NExT-QA
benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only
8.4 and 8.2 frames used on average. These results demonstrate superior
effectiveness and efficiency of our method over the current state-of-the-art
methods, highlighting the potential of agent-based approaches in advancing
long-form video understanding.Summary
AI-Generated Summary