ChatPaper.aiChatPaper

MA-LMM:用于长期视频理解的记忆增强型大型多模态模型

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

April 8, 2024
作者: Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim
cs.AI

摘要

随着大型语言模型(LLMs)的成功,最近将视觉模型整合到LLMs中构建视觉-语言基础模型引起了更多关注。然而,现有基于LLMs的大型多模态模型(例如Video-LLaMA,VideoChat)只能处理有限数量的帧以进行短视频理解。在本研究中,我们主要专注于设计一种高效且有效的模型,用于长期视频理解。与大多数现有工作一样,我们提出通过在线方式处理视频,并将过去的视频信息存储在内存库中。这使得我们的模型可以参考历史视频内容进行长期分析,而不会超出LLMs的上下文长度限制或GPU内存限制。我们的内存库可以轻松集成到当前的多模态LLMs中,具有即插即用的特点。我们在各种视频理解任务上进行了大量实验,如长视频理解、视频问答和视频字幕生成,我们的模型在多个数据集上均取得了最先进的性能。代码可在https://boheumd.github.io/MA-LMM/找到。
English
With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

Summary

AI-Generated Summary

PDF230December 15, 2024