SpeechVerse：一個大規模通用的音訊語言模型

摘要

大型語言模型（LLMs）已展現出在執行需要對自然語言指令進行語義理解的任務中具有非凡的精通能力。最近，許多研究進一步擴展了這種能力，以感知多模式音頻和文本輸入，但它們的能力通常僅限於特定的微調任務，如自動語音識別和翻譯。因此，我們開發了SpeechVerse，這是一個強大的多任務訓練和課程學習框架，通過一小組可學習的參數將預訓練的語音和文本基礎模型結合在一起，同時在訓練期間保持預訓練模型凍結。這些模型是通過從語音基礎模型中提取的連續潛在表示進行指令微調，以實現在各種語音處理任務中使用自然語言指令達到最佳的零-shot性能。我們進行了廣泛的基準測試，包括對比我們的模型性能與傳統基準在多個數據集和任務上的表現。此外，我們評估了模型在泛化指令遵循方面的能力，通過在域外數據集、新穎提示和未見過的任務上進行測試。我們的實驗結果顯示，我們的多任務SpeechVerse模型在11個任務中有9個優於傳統特定任務基準。

English

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.

SpeechVerse：一個大規模通用的音訊語言模型

SpeechVerse: A Large-scale Generalizable Audio Language Model

摘要

Support