视频-SALMONN：语音增强的视听大型语言模型

摘要

语音理解作为更通用的视频理解的一个要素，利用音频-视觉大型语言模型（av-LLMs）是一个至关重要但鲜为人知的方面。本文提出了视频-SALMONN，这是一个端到端的av-LLM，用于视频处理，不仅可以理解视觉帧序列、音频事件和音乐，还可以理解语音。为了获得语音理解所需的细粒度时间信息，同时保持对其他视频元素的高效处理，本文提出了一种新颖的多分辨率因果Q-Former（MRC Q-Former）结构，用于连接预训练的音频-视觉编码器和骨干大型语言模型。此外，提出了专门的训练方法，包括多样性损失和非配对的音频-视觉混合训练方案，以避免帧或模态的主导。在引入的语音-音频-视觉评估基准上，视频-SALMONN在视频问答任务上实现了超过25\%的绝对准确度改进，并在带有人类语音的音频-视觉问答任务上实现了超过30\%的绝对准确度改进。此外，视频-SALMONN展示了在其他音频-视觉大型语言模型无法实现的任务上出色的视频理解和推理能力。我们的训练代码和模型检查点可在\url{https://github.com/bytedance/SALMONN/}上获得。

English

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \url{https://github.com/bytedance/SALMONN/}.

视频-SALMONN：语音增强的视听大型语言模型

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

摘要

Support