MovieChat: 긴 영화 이해를 위한 밀집 토큰에서 희소 메모리로의 전환

초록

최근 비디오 기반 모델과 대형 언어 모델을 통합하여 특정 사전 정의된 비전 작업의 한계를 극복하는 비디오 이해 시스템을 구축하는 연구가 활발히 진행되고 있습니다. 그러나 기존 시스템은 극소수의 프레임으로 구성된 비디오만 처리할 수 있습니다. 긴 비디오의 경우 계산 복잡성, 메모리 비용, 그리고 장기간의 시간적 연결이 여전히 해결해야 할 과제로 남아 있습니다. Atkinson-Shiffrin 기억 모델에서 영감을 받아, 우리는 빠르게 업데이트되는 단기 기억과 간결하면서도 지속적인 장기 기억을 포함하는 메모리 메커니즘을 개발했습니다. 우리는 트랜스포머의 토큰을 메모리의 운반체로 사용합니다. MovieChat은 긴 비디오 이해 분야에서 최첨단 성능을 달성했습니다.

English

Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.

MovieChat: 긴 영화 이해를 위한 밀집 토큰에서 희소 메모리로의 전환

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

초록

Support