AudioPaLM: 話し、聞くことができる大規模言語モデル

要旨

AudioPaLMを紹介します。これは音声理解と生成のための大規模言語モデルです。AudioPaLMは、テキストベースの言語モデルPaLM-2 [Anil et al., 2023]と音声ベースの言語モデルAudioLM [Borsos et al., 2022]を統合し、テキストと音声を処理・生成できるマルチモーダルアーキテクチャを実現しています。このモデルは、音声認識や音声間翻訳などのアプリケーションに適用可能です。AudioPaLMは、AudioLMから話者識別やイントネーションなどのパラ言語情報を保持する能力を継承し、PaLM-2のようなテキスト大規模言語モデルにのみ存在する言語知識も備えています。テキストのみの大規模言語モデルの重みでAudioPaLMを初期化することで、音声処理が改善され、事前学習で使用された大量のテキストデータが音声タスクに活用されることを実証しました。その結果、このモデルは既存の音声翻訳システムを大幅に上回り、訓練で見られなかった入力/ターゲット言語の組み合わせに対してもゼロショット音声テキスト翻訳を実行できる能力を持っています。また、AudioPaLMは、短い音声プロンプトに基づいて声を言語間で転送するなど、音声言語モデルの特徴も示しています。当手法の例はhttps://google-research.github.io/seanet/audiopalm/examplesで公開しています。

English

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

AudioPaLM: 話し、聞くことができる大規模言語モデル

AudioPaLM: A Large Language Model That Can Speak and Listen

要旨

Support