MA-LMM: 장기 비디오 이해를 위한 메모리 증강 대형 멀티모달 모델

초록

대규모 언어 모델(LLM)의 성공과 함께, 시각 모델을 LLM에 통합하여 시각-언어 기반 모델을 구축하는 데 대한 관심이 최근 크게 증가하고 있습니다. 그러나 기존의 LLM 기반 대규모 다중모달 모델(예: Video-LLaMA, VideoChat)은 짧은 비디오 이해를 위해 제한된 수의 프레임만을 입력으로 받을 수 있습니다. 본 연구에서는 장기 비디오 이해를 위한 효율적이고 효과적인 모델 설계에 주력합니다. 기존 연구 대부분처럼 더 많은 프레임을 동시에 처리하려는 대신, 우리는 비디오를 온라인 방식으로 처리하고 과거 비디오 정보를 메모리 뱅크에 저장하는 방식을 제안합니다. 이를 통해 우리 모델은 LLM의 컨텍스트 길이 제약이나 GPU 메모리 한계를 초과하지 않으면서도 장기 분석을 위해 과거 비디오 내용을 참조할 수 있습니다. 우리의 메모리 뱅크는 현재의 다중모달 LLM에 즉시 통합될 수 있습니다. 우리는 장기 비디오 이해, 비디오 질의응답, 비디오 캡셔닝과 같은 다양한 비디오 이해 작업에 대해 광범위한 실험을 수행했으며, 우리 모델은 여러 데이터셋에서 최첨단 성능을 달성할 수 있습니다. 코드는 https://boheumd.github.io/MA-LMM/에서 확인할 수 있습니다.

English

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

MA-LMM: 장기 비디오 이해를 위한 메모리 증강 대형 멀티모달 모델

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

초록

Support