ChatPaper.aiChatPaper

AudioPaLM:一种可以说话和听的大型语言模型

AudioPaLM: A Large Language Model That Can Speak and Listen

June 22, 2023
作者: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank
cs.AI

摘要

我们介绍了AudioPaLM,这是一个用于语音理解和生成的大型语言模型。AudioPaLM将基于文本的语言模型PaLM-2[Anil等,2023]和基于语音的语言模型AudioLM[Borsos等,2022]融合到一个统一的多模态架构中,能够处理和生成文本和语音,应用包括语音识别和语音到语音的翻译。AudioPaLM继承了从AudioLM中保留的保留语用信息(如说话者身份和语调)的能力,以及仅存在于文本大型语言模型(如PaLM-2)中的语言知识。我们证明,使用文本-唯一大型语言模型的权重初始化AudioPaLM可以改善语音处理,成功利用预训练中使用的更多文本训练数据来辅助语音任务。由此产生的模型在语音翻译任务中明显优于现有系统,并具有执行许多在训练中未见过的语言输入/目标语言组合的零翻译语音到文本的能力。AudioPaLM还展示了音频语言模型的特征,例如基于简短口语提示跨语言传递语音。我们在https://google-research.github.io/seanet/audiopalm/examples发布了我们方法的示例。
English
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples
PDF546December 15, 2024