ChatPaper.aiChatPaper

MA-LMM:記憶增強型大型多模態模型,用於長期視頻理解

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

April 8, 2024
作者: Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim
cs.AI

摘要

隨著大型語言模型(LLMs)的成功,將視覺模型整合到LLMs中,以建立視覺語言基礎模型,最近引起了更多的興趣。然而,現有基於LLM的大型多模態模型(例如Video-LLaMA、VideoChat)僅能處理有限數量的幀以進行短視頻理解。在本研究中,我們主要專注於設計一個高效且有效的模型,用於長期視頻理解。與大多數現有工作一樣,不是嘗試同時處理更多幀,我們提出以在線方式處理視頻,並將過去的視頻信息存儲在記憶庫中。這使得我們的模型能夠參考歷史視頻內容進行長期分析,而不會超出LLMs的上下文長度限制或GPU內存限制。我們的記憶庫可以無縫集成到當前的多模態LLMs中,以現成的方式。我們在各種視頻理解任務上進行了廣泛實驗,例如長視頻理解、視頻問答和視頻字幕生成,我們的模型在多個數據集上實現了最先進的性能。代碼可在https://boheumd.github.io/MA-LMM/找到。
English
With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

Summary

AI-Generated Summary

PDF230December 15, 2024