MA-LMM：用于长期视频理解的记忆增强型大型多模态模型

摘要

随着大型语言模型（LLMs）的成功，最近将视觉模型整合到LLMs中构建视觉-语言基础模型引起了更多关注。然而，现有基于LLMs的大型多模态模型（例如Video-LLaMA，VideoChat）只能处理有限数量的帧以进行短视频理解。在本研究中，我们主要专注于设计一种高效且有效的模型，用于长期视频理解。与大多数现有工作一样，我们提出通过在线方式处理视频，并将过去的视频信息存储在内存库中。这使得我们的模型可以参考历史视频内容进行长期分析，而不会超出LLMs的上下文长度限制或GPU内存限制。我们的内存库可以轻松集成到当前的多模态LLMs中，具有即插即用的特点。我们在各种视频理解任务上进行了大量实验，如长视频理解、视频问答和视频字幕生成，我们的模型在多个数据集上均取得了最先进的性能。代码可在https://boheumd.github.io/MA-LMM/找到。

English

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

MA-LMM：用于长期视频理解的记忆增强型大型多模态模型

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

摘要

Support