利用非配對數據實現大型語言模型的通用語音能力

摘要

在這項工作中，我們擴展了調整指令的 Llama-2 模型，具有端對端通用語音處理和推理能力，同時保持了 LLM 能力的廣泛範圍，而無需使用精心編輯的配對數據。所提出的模型可以利用音頻提示替代文本並維持對話。這種模型還具有擴展的跨模態能力，例如能夠執行語音問答、語音翻譯和音頻摘要等許多封閉和開放域任務。這與先前在語音方面的方法不同，先前的方法將 LLM 擴展到處理有限數量的預先指定任務的音頻。實驗表明，我們的端對端方法在建模對提示的回應方面與串聯系統（語音識別器 + LLM）不相上下，甚至表現更好。此外，與串聯系統不同，我們的方法展示了在對話中交換文本和音頻模態並利用先前上下文以獲得更好結果的能力。

English

In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of LLM capabilities, without using any carefully curated paired data. The proposed model can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform speech question answering, speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. Experiments show that our end-to-end approach is on par with or outperforms a cascaded system (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike a cascade, our approach shows the ability to interchange text and audio modalities and utilize the prior context in a conversation to provide better results.

利用非配對數據實現大型語言模型的通用語音能力

Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data

摘要

Support