AudioPaLM：一种可以说话和听的大型语言模型

摘要

我们介绍了AudioPaLM，这是一个用于语音理解和生成的大型语言模型。AudioPaLM将基于文本的语言模型PaLM-2[Anil等，2023]和基于语音的语言模型AudioLM[Borsos等，2022]融合到一个统一的多模态架构中，能够处理和生成文本和语音，应用包括语音识别和语音到语音的翻译。AudioPaLM继承了从AudioLM中保留的保留语用信息（如说话者身份和语调）的能力，以及仅存在于文本大型语言模型（如PaLM-2）中的语言知识。我们证明，使用文本-唯一大型语言模型的权重初始化AudioPaLM可以改善语音处理，成功利用预训练中使用的更多文本训练数据来辅助语音任务。由此产生的模型在语音翻译任务中明显优于现有系统，并具有执行许多在训练中未见过的语言输入/目标语言组合的零翻译语音到文本的能力。AudioPaLM还展示了音频语言模型的特征，例如基于简短口语提示跨语言传递语音。我们在https://google-research.github.io/seanet/audiopalm/examples发布了我们方法的示例。

English

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

AudioPaLM：一种可以说话和听的大型语言模型

AudioPaLM: A Large Language Model That Can Speak and Listen

摘要

Support