利用无配对数据实现大型语言模型的通用语音能力路径
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
November 12, 2023
作者: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
cs.AI
摘要
在这项工作中,我们通过添加端到端通用语音处理和推理能力来扩展经过指导调整的 Llama-2 模型,同时保持广泛的 LLM 能力范围,而无需使用任何精心策划的配对数据。所提出的模型可以利用音频提示替代文本并维持对话。这样的模型还具有扩展的跨模态能力,例如能够执行语音问答、语音翻译和音频摘要等许多封闭和开放领域任务。这与先前在语音领域的方法不同,先前的方法是将 LLMs 扩展到处理有限数量的预先指定任务的音频。实验证明,我们的端到端方法在对某个提示的响应建模方面与级联系统(语音识别器 + LLM)不相上下,甚至表现更好。此外,与级联系统不同,我们的方法显示出能够交换文本和音频模态,并利用对话中的先前上下文以提供更好的结果。
English
In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results.