MovieChat: 長尺動画理解のための高密度トークンから疎メモリへ

要旨

近年、ビデオ基盤モデルと大規模言語モデルを統合し、特定の事前定義された視覚タスクの限界を克服するビデオ理解システムの構築が進められています。しかし、既存のシステムは非常に少ないフレーム数のビデオしか扱うことができません。長いビデオに対しては、計算の複雑さ、メモリコスト、そして長期的な時間的接続性が残された課題です。アトキンソン-シフリン記憶モデルに着想を得て、我々は迅速に更新される短期記憶とコンパクトで持続的な長期記憶を含む記憶メカニズムを開発しました。Transformerのトークンを記憶の担い手として採用しています。MovieChatは、長いビデオの理解において最先端の性能を達成しています。

English

Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.

MovieChat: 長尺動画理解のための高密度トークンから疎メモリへ

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

要旨

Support