视频-SALMONN:语音增强的视听大型语言模型
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
June 22, 2024
作者: Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang
cs.AI
摘要
语音理解作为更通用的视频理解的一个要素,利用音频-视觉大型语言模型(av-LLMs)是一个至关重要但鲜为人知的方面。本文提出了视频-SALMONN,这是一个端到端的av-LLM,用于视频处理,不仅可以理解视觉帧序列、音频事件和音乐,还可以理解语音。为了获得语音理解所需的细粒度时间信息,同时保持对其他视频元素的高效处理,本文提出了一种新颖的多分辨率因果Q-Former(MRC Q-Former)结构,用于连接预训练的音频-视觉编码器和骨干大型语言模型。此外,提出了专门的训练方法,包括多样性损失和非配对的音频-视觉混合训练方案,以避免帧或模态的主导。在引入的语音-音频-视觉评估基准上,视频-SALMONN在视频问答任务上实现了超过25\%的绝对准确度改进,并在带有人类语音的音频-视觉问答任务上实现了超过30\%的绝对准确度改进。此外,视频-SALMONN展示了在其他音频-视觉大型语言模型无法实现的任务上出色的视频理解和推理能力。我们的训练代码和模型检查点可在\url{https://github.com/bytedance/SALMONN/}上获得。
English
Speech understanding as an element of the more generic video understanding
using audio-visual large language models (av-LLMs) is a crucial yet
understudied aspect. This paper proposes video-SALMONN, a single end-to-end
av-LLM for video processing, which can understand not only visual frame
sequences, audio events and music, but speech as well. To obtain fine-grained
temporal information required by speech understanding, while keeping efficient
for other video elements, this paper proposes a novel multi-resolution causal
Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders
and the backbone large language model. Moreover, dedicated training approaches
including the diversity loss and the unpaired audio-visual mixed training
scheme are proposed to avoid frames or modality dominance. On the introduced
speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\%
absolute accuracy improvements on the video-QA task and over 30\% absolute
accuracy improvements on audio-visual QA tasks with human speech. In addition,
video-SALMONN demonstrates remarkable video comprehension and reasoning
abilities on tasks that are unprecedented by other av-LLMs. Our training code
and model checkpoints are available at
\url{https://github.com/bytedance/SALMONN/}.Summary
AI-Generated Summary