SpeechVerse：一个大规模通用的音频语言模型

摘要

大型语言模型（LLMs）展现出在执行需要对自然语言指令进行语义理解的任务方面的惊人熟练度。最近，许多研究进一步扩展了这种能力，以感知多模态音频和文本输入，但它们的能力通常局限于特定的微调任务，如自动语音识别和翻译。因此，我们开发了SpeechVerse，这是一个强大的多任务训练和课程学习框架，它通过一小组可学习参数将预训练的语音和文本基础模型结合起来，同时在训练过程中保持预训练模型冻结状态。这些模型通过从语音基础模型中提取的连续潜在表示进行指令微调，以实现在使用自然语言指令进行各种语音处理任务时的最佳零-shot性能。我们进行了广泛的基准测试，包括将我们的模型性能与几个数据集和任务上的传统基线进行比较。此外，我们评估了模型在广义指令遵循方面的能力，通过在域外数据集、新颖提示和未见任务上进行测试。我们的实证实验显示，我们的多任务SpeechVerse模型在11个任务中有9个任务优于传统的特定任务基线。

English

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.

SpeechVerse：一个大规模通用的音频语言模型

SpeechVerse: A Large-scale Generalizable Audio Language Model

摘要

Support