ChatPaper.aiChatPaper

VideoLLaMA 2:在视频LLMs中推进时空建模和音频理解

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

June 11, 2024
作者: Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
cs.AI

摘要

本文介绍了VideoLLaMA 2,这是一组Video Large Language Models(Video-LLMs),旨在增强视频和音频导向任务中的时空建模和音频理解。在其前身基础上,VideoLLaMA 2集成了定制的时空卷积(STC)连接器,有效捕捉视频数据的复杂时空动态。此外,我们通过联合训练将音频分支整合到模型中,从而通过无缝整合音频线索,丰富了模型的多模态理解能力。在多项选择视频问答(MC-VQA)、开放式视频问答(OE-VQA)和视频字幕(VC)任务上进行全面评估表明,VideoLLaMA 2在开源模型中始终取得竞争力强的结果,甚至在几个基准测试中接近一些专有模型。此外,VideoLLaMA 2在现有模型的音频问答(AQA)和音视频问答(OE-AVQA)基准测试中表现出了合理的改进。这些进展突显了VideoLLaMA 2在多模态理解方面的卓越性能,为智能视频分析系统树立了新的标准。所有模型均为公开,以促进进一步研究。
English
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

Summary

AI-Generated Summary

PDF382December 8, 2024