电影聊天：从密集标记到稀疏记忆的长视频理解

摘要

最近，将视频基础模型和大型语言模型整合，构建了一个视频理解系统，克服了特定预定义视觉任务的局限性。然而，现有系统只能处理帧数很少的视频。对于长视频来说，计算复杂度、内存成本和长期时间连接仍然是挑战。受Atkinson-Shiffrin记忆模型启发，我们开发了一个包括快速更新的短期记忆和紧凑的、因此持久的长期记忆的记忆机制。我们使用Transformer中的token作为记忆的载体。MovieChat在长视频理解方面实现了最先进的性能。

English

Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.

电影聊天：从密集标记到稀疏记忆的长视频理解

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

摘要

Support