ChatPaper.aiChatPaper

电影聊天:从密集标记到稀疏记忆的长视频理解

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

July 31, 2023
作者: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
cs.AI

摘要

最近,将视频基础模型和大型语言模型整合,构建了一个视频理解系统,克服了特定预定义视觉任务的局限性。然而,现有系统只能处理帧数很少的视频。对于长视频来说,计算复杂度、内存成本和长期时间连接仍然是挑战。受Atkinson-Shiffrin记忆模型启发,我们开发了一个包括快速更新的短期记忆和紧凑的、因此持久的长期记忆的记忆机制。我们使用Transformer中的token作为记忆的载体。MovieChat在长视频理解方面实现了最先进的性能。
English
Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.
PDF160December 15, 2024