MiniGPT4-Video:通過交錯視覺-文本標記來推進視頻理解的多模態LLM
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
April 4, 2024
作者: Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny
cs.AI
摘要
本文介紹了MiniGPT4-Video,一種專為視頻理解而設計的多模式大型語言模型(LLM)。該模型能夠處理時間視覺和文本數據,使其擅長理解視頻的複雜性。在MiniGPT-v2取得成功的基礎上,該模型在將視覺特徵轉換為LLM空間以處理單張圖像方面表現出色,並在各種圖像-文本基準測試中取得了令人印象深刻的成果,本文將擴展模型的能力以處理一系列幀,使其能夠理解視頻。MiniGPT4-Video不僅考慮視覺內容,還融入了文本對話,使模型能夠有效回答涉及視覺和文本組件的查詢。所提出的模型優於現有的最先進方法,在MSVD、MSRVTT、TGIF和TVQA基準測試中分別取得了4.22%、1.13%、20.82%和13.1%的增益。我們的模型和代碼已公開提供,網址為https://vision-cair.github.io/MiniGPT4-video/
English
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM)
designed specifically for video understanding. The model is capable of
processing both temporal visual and textual data, making it adept at
understanding the complexities of videos. Building upon the success of
MiniGPT-v2, which excelled in translating visual features into the LLM space
for single images and achieved impressive results on various image-text
benchmarks, this paper extends the model's capabilities to process a sequence
of frames, enabling it to comprehend videos. MiniGPT4-video does not only
consider visual content but also incorporates textual conversations, allowing
the model to effectively answer queries involving both visual and text
components. The proposed model outperforms existing state-of-the-art methods,
registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF,
and TVQA benchmarks respectively. Our models and code have been made publicly
available here https://vision-cair.github.io/MiniGPT4-video/Summary
AI-Generated Summary