电影聊天:从密集标记到稀疏记忆的长视频理解
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
July 31, 2023
作者: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
cs.AI
摘要
最近,将视频基础模型和大型语言模型整合,构建了一个视频理解系统,克服了特定预定义视觉任务的局限性。然而,现有系统只能处理帧数很少的视频。对于长视频来说,计算复杂度、内存成本和长期时间连接仍然是挑战。受Atkinson-Shiffrin记忆模型启发,我们开发了一个包括快速更新的短期记忆和紧凑的、因此持久的长期记忆的记忆机制。我们使用Transformer中的token作为记忆的载体。MovieChat在长视频理解方面实现了最先进的性能。
English
Recently, integrating video foundation models and large language models to
build a video understanding system overcoming the limitations of specific
pre-defined vision tasks. Yet, existing systems can only handle videos with
very few frames. For long videos, the computation complexity, memory cost, and
long-term temporal connection are the remaining challenges. Inspired by
Atkinson-Shiffrin memory model, we develop an memory mechanism including a
rapidly updated short-term memory and a compact thus sustained long-term
memory. We employ tokens in Transformers as the carriers of memory. MovieChat
achieves state-of-the-art performace in long video understanding.