利用非配對數據實現大型語言模型的通用語音能力
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
November 12, 2023
作者: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
cs.AI
摘要
在這項工作中,我們擴展了調整指令的 Llama-2 模型,具有端對端通用語音處理和推理能力,同時保持了 LLM 能力的廣泛範圍,而無需使用精心編輯的配對數據。所提出的模型可以利用音頻提示替代文本並維持對話。這種模型還具有擴展的跨模態能力,例如能夠執行語音問答、語音翻譯和音頻摘要等許多封閉和開放域任務。這與先前在語音方面的方法不同,先前的方法將 LLM 擴展到處理有限數量的預先指定任務的音頻。實驗表明,我們的端對端方法在建模對提示的回應方面與串聯系統(語音識別器 + LLM)不相上下,甚至表現更好。此外,與串聯系統不同,我們的方法展示了在對話中交換文本和音頻模態並利用先前上下文以獲得更好結果的能力。
English
In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results.