視頻-SALMONN：語音增強的視聽大型語言模型

摘要

語音理解作為更通用的視頻理解的一部分，利用音視覺大型語言模型（av-LLMs）是一個至關重要但鮮有研究的方面。本文提出了video-SALMONN，一個端到端的av-LLM，用於視頻處理，不僅可以理解視覺幀序列、音頻事件和音樂，還可以理解語音。為了獲得語音理解所需的細粒度時間信息，同時保持對其他視頻元素的高效性，本文提出了一種新穎的多分辨率因果Q-Former（MRC Q-Former）結構，用於連接預先訓練的音視覺編碼器和主幹大型語言模型。此外，提出了專門的訓練方法，包括多樣性損失和非配對音視覺混合訓練方案，以避免幀或模態的主導。在引入的語音-音視覺評估基準上，video-SALMONN在視頻問答任務上實現了超過25％的絕對準確度改進，並在包含人類語音的音視覺問答任務上實現了超過30％的絕對準確度改進。此外，video-SALMONN在其他音視覺大型語言模型無法實現的任務上展示出卓越的視頻理解和推理能力。我們的訓練代碼和模型檢查點可在\url{https://github.com/bytedance/SALMONN/}上找到。

English

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \url{https://github.com/bytedance/SALMONN/}.

視頻-SALMONN：語音增強的視聽大型語言模型

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

摘要

Support