ChatPaper.aiChatPaper

视频代理:利用大型语言模型作为代理的长视频理解

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

March 15, 2024
作者: Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
cs.AI

摘要

长篇视频理解在计算机视觉中是一个重大挑战,需要一种能够推理长时间多模态序列的模型。受人类认知过程对长篇视频理解的启发,我们强调互动推理和规划,而不是处理长篇视觉输入的能力。我们引入了一种新颖的基于代理的系统,VideoAgent,它采用一个大型语言模型作为中央代理,迭代地识别和整理关键信息以回答问题,同时利用视觉-语言基础模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中,VideoAgent 在零样本准确率上分别达到了54.1%和71.3%,平均仅使用了8.4和8.2帧。这些结果表明我们的方法在效果和效率上优于当前最先进的方法,突显了基于代理的方法在推进长篇视频理解方面的潜力。
English
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.

Summary

AI-Generated Summary

PDF362December 15, 2024