ChatPaper.aiChatPaper

VideoLLaMA 2:在Video-LLMs中推進時空建模和音訊理解

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

June 11, 2024
作者: Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
cs.AI

摘要

本文介紹了VideoLLaMA 2,一組Video Large Language Models(Video-LLMs),旨在增強視頻和音頻導向任務中的時空建模和音頻理解。在其前身的基礎上,VideoLLaMA 2包含了一個定製的空間-時間卷積(STC)連接器,有效捕捉視頻數據的複雜空間和時間動態。此外,我們通過聯合訓練將音頻分支整合到模型中,從而通過無縫整合音頻提示,豐富了模型的多模態理解能力。在多選視頻問答(MC-VQA)、開放式視頻問答(OE-VQA)和視頻字幕(VC)任務上進行的全面評估表明,VideoLLaMA 2在多個基準測試中始終取得競爭優勢,甚至在一些基準測試中接近一些專有模型。此外,VideoLLaMA 2在現有模型的僅音頻和音視頻問答(AQA和OE-AVQA)基準測試中展現出合理的改進。這些進步凸顯了VideoLLaMA 2在多模態理解方面的卓越表現,為智能視頻分析系統設定了新標準。所有模型均為公開,以促進進一步的研究。
English
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

Summary

AI-Generated Summary

PDF382December 8, 2024