ChatPaper.aiChatPaper

视频-SALMONN:语音增强的视听大型语言模型

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

June 22, 2024
作者: Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang
cs.AI

摘要

语音理解作为更通用的视频理解的一个要素,利用音频-视觉大型语言模型(av-LLMs)是一个至关重要但鲜为人知的方面。本文提出了视频-SALMONN,这是一个端到端的av-LLM,用于视频处理,不仅可以理解视觉帧序列、音频事件和音乐,还可以理解语音。为了获得语音理解所需的细粒度时间信息,同时保持对其他视频元素的高效处理,本文提出了一种新颖的多分辨率因果Q-Former(MRC Q-Former)结构,用于连接预训练的音频-视觉编码器和骨干大型语言模型。此外,提出了专门的训练方法,包括多样性损失和非配对的音频-视觉混合训练方案,以避免帧或模态的主导。在引入的语音-音频-视觉评估基准上,视频-SALMONN在视频问答任务上实现了超过25\%的绝对准确度改进,并在带有人类语音的音频-视觉问答任务上实现了超过30\%的绝对准确度改进。此外,视频-SALMONN展示了在其他音频-视觉大型语言模型无法实现的任务上出色的视频理解和推理能力。我们的训练代码和模型检查点可在\url{https://github.com/bytedance/SALMONN/}上获得。
English
Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \url{https://github.com/bytedance/SALMONN/}.

Summary

AI-Generated Summary

PDF51November 29, 2024