ChatPaper.aiChatPaper

VideoAgent:一個以記憶增強的多模式代理程式,用於視訊理解。

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

March 18, 2024
作者: Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
cs.AI

摘要

我們探討如何通過引入一種新型統一記憶機制,來協調多個基礎模型(大型語言模型和視覺語言模型),以應對具有挑戰性的視頻理解問題,特別是捕捉長視頻中的長期時間關係。具體而言,所提出的多模態代理VideoAgent:1)構建了一個結構化記憶體,用於存儲視頻的通用時間事件描述和以物為中心的追踪狀態;2)在給定輸入任務查詢時,它利用視頻段定位和物體記憶查詢等工具,以及其他視覺基礎模型來互動解決任務,利用LLM的零-shot工具使用能力。VideoAgent在幾個長視頻理解基準測試中展現出令人印象深刻的表現,相對於基準模型,NExT-QA平均提高了6.6%,EgoSchema提高了26.0%,縮小了開源模型和包括Gemini 1.5 Pro在內的私有對手之間的差距。
English
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

Summary

AI-Generated Summary

PDF131December 15, 2024