MiniGPT4-Video: 인터리브된 시각-텍스트 토큰을 활용한 비디오 이해를 위한 멀티모달 LLM의 발전

초록

본 논문은 비디오 이해를 위해 특별히 설계된 다중 모달 대형 언어 모델(LLM)인 MiniGPT4-Video를 소개합니다. 이 모델은 시간적 시각 데이터와 텍스트 데이터를 모두 처리할 수 있어 비디오의 복잡성을 이해하는 데 탁월합니다. 단일 이미지에 대한 시각적 특징을 LLM 공간으로 변환하는 데 뛰어난 성과를 거두며 다양한 이미지-텍스트 벤치마크에서 인상적인 결과를 보였던 MiniGPT-v2의 성공을 기반으로, 이 논문은 모델의 기능을 프레임 시퀀스 처리로 확장하여 비디오를 이해할 수 있도록 합니다. MiniGPT4-Video는 시각적 콘텐츠뿐만 아니라 텍스트 대화도 통합하여 시각 및 텍스트 요소를 모두 포함한 질문에 효과적으로 답변할 수 있습니다. 제안된 모델은 기존의 최첨단 방법들을 능가하며, MSVD, MSRVTT, TGIF, TVQA 벤치마크에서 각각 4.22%, 1.13%, 20.82%, 13.1%의 성능 향상을 기록했습니다. 우리의 모델과 코드는 https://vision-cair.github.io/MiniGPT4-video/에서 공개되었습니다.

English

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here https://vision-cair.github.io/MiniGPT4-video/

MiniGPT4-Video: 인터리브된 시각-텍스트 토큰을 활용한 비디오 이해를 위한 멀티모달 LLM의 발전

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

초록

Support