视频代理：利用大型语言模型作为代理的长视频理解

摘要

长篇视频理解在计算机视觉中是一个重大挑战，需要一种能够推理长时间多模态序列的模型。受人类认知过程对长篇视频理解的启发，我们强调互动推理和规划，而不是处理长篇视觉输入的能力。我们引入了一种新颖的基于代理的系统，VideoAgent，它采用一个大型语言模型作为中央代理，迭代地识别和整理关键信息以回答问题，同时利用视觉-语言基础模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中，VideoAgent 在零样本准确率上分别达到了54.1%和71.3%，平均仅使用了8.4和8.2帧。这些结果表明我们的方法在效果和效率上优于当前最先进的方法，突显了基于代理的方法在推进长篇视频理解方面的潜力。

English

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.

视频代理：利用大型语言模型作为代理的长视频理解

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

摘要

Support