VideoAgent:一种用于视频理解的记忆增强型多模态代理系统
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
March 18, 2024
作者: Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
cs.AI
摘要
我们探讨了如何通过引入一种新颖的统一记忆机制来协调几种基础模型(大型语言模型和视觉-语言模型),以解决具有挑战性的视频理解问题,特别是捕捉长视频中的长期时间关系。具体而言,所提出的多模态代理VideoAgent:1)构建了一个结构化记忆,用于存储视频的通用时间事件描述和以对象为中心的跟踪状态;2)针对输入的任务查询,它利用视频段定位和对象记忆查询等工具以及其他视觉基础模型来交互式地解决任务,利用大型语言模型的零-shot工具使用能力。VideoAgent在几个长时间跨度视频理解基准测试中展现出令人印象深刻的性能,相比基线模型,NExT-QA平均提高了6.6%,EgoSchema提高了26.0%,缩小了开源模型和包括Gemini 1.5 Pro在内的私有对手之间的差距。
English
We explore how reconciling several foundation models (large language models
and vision-language models) with a novel unified memory mechanism could tackle
the challenging video understanding problem, especially capturing the long-term
temporal relations in lengthy videos. In particular, the proposed multimodal
agent VideoAgent: 1) constructs a structured memory to store both the generic
temporal event descriptions and object-centric tracking states of the video; 2)
given an input task query, it employs tools including video segment
localization and object memory querying along with other visual foundation
models to interactively solve the task, utilizing the zero-shot tool-use
ability of LLMs. VideoAgent demonstrates impressive performances on several
long-horizon video understanding benchmarks, an average increase of 6.6% on
NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between
open-sourced models and private counterparts including Gemini 1.5 Pro.Summary
AI-Generated Summary