視頻-SALMONN:語音增強的視聽大型語言模型
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
June 22, 2024
作者: Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang
cs.AI
摘要
語音理解作為更通用的視頻理解的一部分,利用音視覺大型語言模型(av-LLMs)是一個至關重要但鮮有研究的方面。本文提出了video-SALMONN,一個端到端的av-LLM,用於視頻處理,不僅可以理解視覺幀序列、音頻事件和音樂,還可以理解語音。為了獲得語音理解所需的細粒度時間信息,同時保持對其他視頻元素的高效性,本文提出了一種新穎的多分辨率因果Q-Former(MRC Q-Former)結構,用於連接預先訓練的音視覺編碼器和主幹大型語言模型。此外,提出了專門的訓練方法,包括多樣性損失和非配對音視覺混合訓練方案,以避免幀或模態的主導。在引入的語音-音視覺評估基準上,video-SALMONN在視頻問答任務上實現了超過25%的絕對準確度改進,並在包含人類語音的音視覺問答任務上實現了超過30%的絕對準確度改進。此外,video-SALMONN在其他音視覺大型語言模型無法實現的任務上展示出卓越的視頻理解和推理能力。我們的訓練代碼和模型檢查點可在\url{https://github.com/bytedance/SALMONN/}上找到。
English
Speech understanding as an element of the more generic video understanding
using audio-visual large language models (av-LLMs) is a crucial yet
understudied aspect. This paper proposes video-SALMONN, a single end-to-end
av-LLM for video processing, which can understand not only visual frame
sequences, audio events and music, but speech as well. To obtain fine-grained
temporal information required by speech understanding, while keeping efficient
for other video elements, this paper proposes a novel multi-resolution causal
Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders
and the backbone large language model. Moreover, dedicated training approaches
including the diversity loss and the unpaired audio-visual mixed training
scheme are proposed to avoid frames or modality dominance. On the introduced
speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\%
absolute accuracy improvements on the video-QA task and over 30\% absolute
accuracy improvements on audio-visual QA tasks with human speech. In addition,
video-SALMONN demonstrates remarkable video comprehension and reasoning
abilities on tasks that are unprecedented by other av-LLMs. Our training code
and model checkpoints are available at
\url{https://github.com/bytedance/SALMONN/}.Summary
AI-Generated Summary