WavLLM:迈向稳健与自适应的语音大型语言模型
WavLLM: Towards Robust and Adaptive Speech Large Language Model
March 31, 2024
作者: Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei
cs.AI
摘要
近年来,大型语言模型(LLMs)的进步彻底改变了自然语言处理领域,其应用范围逐步扩展至多模态感知与生成。然而,将听觉能力有效融入LLMs面临重大挑战,尤其是在跨不同情境的泛化能力和执行复杂听觉任务方面。本文中,我们提出了WavLLM,一种具备双编码器和提示感知LoRA权重适配器的鲁棒自适应语音大型语言模型,通过两阶段课程学习方法进行优化。借助双编码器,我们分离了不同类型的语音信息,使用Whisper编码器处理语音的语义内容,而WavLM编码器则捕捉说话者身份的独特特征。在课程学习框架下,WavLLM首先通过优化混合基础单一任务构建其基础能力,随后在更复杂的任务上进行高级多任务训练,如基础任务的组合。为提升模型对不同任务和指令的适应性与遵从性,我们在第二阶段高级多任务训练中引入了提示感知LoRA权重适配器。我们在通用语音基准测试中验证了所提出的模型,涵盖ASR、ST、SV、ER等任务,并应用于如高考英语听力理解集等专业数据集进行SQA和语音链式思维(CoT)评估。实验表明,所提出的模型在相同模型规模下,在多种语音任务中达到了最先进水平,展示了使用CoT方法执行复杂任务的强大泛化能力。此外,我们的模型无需专门训练即可成功完成高考任务。代码、模型、音频及高考评估集可通过aka.ms/wavllm访问。
English
The recent advancements in large language models (LLMs) have revolutionized
the field of natural language processing, progressively broadening their scope
to multimodal perception and generation. However, effectively integrating
listening capabilities into LLMs poses significant challenges, particularly
with respect to generalizing across varied contexts and executing complex
auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech
large language model with dual encoders, and a prompt-aware LoRA weight
adapter, optimized by a two-stage curriculum learning approach. Leveraging dual
encoders, we decouple different types of speech information, utilizing a
Whisper encoder to process the semantic content of speech, and a WavLM encoder
to capture the unique characteristics of the speaker's identity. Within the
curriculum learning framework, WavLLM first builds its foundational
capabilities by optimizing on mixed elementary single tasks, followed by
advanced multi-task training on more complex tasks such as combinations of the
elementary tasks. To enhance the flexibility and adherence to different tasks
and instructions, a prompt-aware LoRA weight adapter is introduced in the
second advanced multi-task training stage. We validate the proposed model on
universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also
apply it to specialized datasets like Gaokao English listening comprehension
set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments
demonstrate that the proposed model achieves state-of-the-art performance
across a range of speech tasks on the same model size, exhibiting robust
generalization capabilities in executing complex tasks using CoT approach.
Furthermore, our model successfully completes Gaokao tasks without specialized
training. The codes, models, audio, and Gaokao evaluation set can be accessed
at aka.ms/wavllm.Summary
AI-Generated Summary