AudioPaLM:一個能說話和聆聽的大型語言模型
AudioPaLM: A Large Language Model That Can Speak and Listen
June 22, 2023
作者: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank
cs.AI
摘要
我們介紹了 AudioPaLM,一個用於語音理解和生成的大型語言模型。AudioPaLM將基於文本的語言模型 PaLM-2 [Anil等,2023] 和基於語音的語言模型 AudioLM [Borsos等,2022] 融合成一個統一的多模態架構,可以處理和生成文本和語音,應用包括語音識別和語音到語音的翻譯。AudioPaLM繼承了從性質,可以保存語音模型 AudioLM 中的語音身份和語調等參語言模型 PaLM-2 中僅存在的語言知識。我們展示了,使用僅包含文本的大型語言模型的權重初始化 AudioPaLM 可以改善語音處理,成功利用預訓練中使用的更多文本訓練數據來協助語音任務。結果顯示,該模型在語音翻譯任務中明顯優於現有系統,並且具有執行許多未在訓練中看到的語言的零-shot 語音到文本翻譯的能力。AudioPaLM 還展示了語音語言模型的特性,例如基於簡短的口語提示跨語言轉移語音。我們在 https://google-research.github.io/seanet/audiopalm/examples 上發布了我們方法的示例。
English
We introduce AudioPaLM, a large language model for speech understanding and
generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2
[Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified
multimodal architecture that can process and generate text and speech with
applications including speech recognition and speech-to-speech translation.
AudioPaLM inherits the capability to preserve paralinguistic information such
as speaker identity and intonation from AudioLM and the linguistic knowledge
present only in text large language models such as PaLM-2. We demonstrate that
initializing AudioPaLM with the weights of a text-only large language model
improves speech processing, successfully leveraging the larger quantity of text
training data used in pretraining to assist with the speech tasks. The
resulting model significantly outperforms existing systems for speech
translation tasks and has the ability to perform zero-shot speech-to-text
translation for many languages for which input/target language combinations
were not seen in training. AudioPaLM also demonstrates features of audio
language models, such as transferring a voice across languages based on a short
spoken prompt. We release examples of our method at
https://google-research.github.io/seanet/audiopalm/examples