電影對話:從密集標記到稀疏記憶的長視頻理解
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
July 31, 2023
作者: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
cs.AI
摘要
最近,整合影片基礎模型和大型語言模型,建立一個影片理解系統,克服特定預定義視覺任務的限制。然而,現有系統僅能處理幾幀的影片。對於長影片,計算複雜度、記憶體成本和長期時間連接仍然是挑戰。受到阿特金森-席夫林記憶模型的啟發,我們開發了一個包括快速更新的短期記憶和緊湊且持久的長期記憶的記憶機制。我們使用Transformer中的token作為記憶的載體。MovieChat在長影片理解方面實現了最先進的性能。
English
Recently, integrating video foundation models and large language models to
build a video understanding system overcoming the limitations of specific
pre-defined vision tasks. Yet, existing systems can only handle videos with
very few frames. For long videos, the computation complexity, memory cost, and
long-term temporal connection are the remaining challenges. Inspired by
Atkinson-Shiffrin memory model, we develop an memory mechanism including a
rapidly updated short-term memory and a compact thus sustained long-term
memory. We employ tokens in Transformers as the carriers of memory. MovieChat
achieves state-of-the-art performace in long video understanding.