WavLLM：ロバストで適応的な音声大規模言語モデルに向けて

要旨

大規模言語モデル（LLM）の最近の進展は、自然言語処理の分野に革命をもたらし、その適用範囲をマルチモーダルな知覚と生成へと拡大してきました。しかし、LLMに聴覚能力を効果的に統合することは、特に多様な文脈での汎化や複雑な聴覚タスクの実行において、大きな課題となっています。本研究では、WavLLMを紹介します。これは、デュアルエンコーダとプロンプト対応LoRA重みアダプタを備えた堅牢で適応性の高い音声大規模言語モデルであり、2段階のカリキュラム学習アプローチによって最適化されています。デュアルエンコーダを活用することで、異なる種類の音声情報を分離し、Whisperエンコーダを使用して音声の意味内容を処理し、WavLMエンコーダを使用して話者の特徴を捕捉します。カリキュラム学習の枠組み内で、WavLLMはまず、基本的な単一タスクの混合最適化によって基礎能力を構築し、その後、基本的なタスクの組み合わせのようなより複雑なタスクに対する高度なマルチタスク訓練を行います。異なるタスクや指示への柔軟性と忠実性を高めるために、2段階目の高度なマルチタスク訓練段階でプロンプト対応LoRA重みアダプタを導入します。提案モデルは、ASR、ST、SV、ERなどのタスクを含む普遍的な音声ベンチマークで検証され、また、SQAのためのGaokao英語リスニング理解セットや音声Chain-of-Thought（CoT）評価セットなどの専門データセットにも適用されます。実験結果は、提案モデルが同じモデルサイズで幅広い音声タスクにおいて最先端の性能を達成し、CoTアプローチを使用して複雑なタスクを実行する際の堅牢な汎化能力を示しています。さらに、我々のモデルは、特別な訓練なしにGaokaoタスクを成功裏に完了します。コード、モデル、音声、およびGaokao評価セットは、aka.ms/wavllmでアクセス可能です。

English

The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at aka.ms/wavllm.

WavLLM：ロバストで適応的な音声大規模言語モデルに向けて

WavLLM: Towards Robust and Adaptive Speech Large Language Model

要旨

Support