Whisper-GPT: ハイブリッド表現オーディオ大規模言語モデル

要旨

私たちは、音声と音楽のための生成的大規模言語モデル（LLM）であるWHISPER-GPTを提案します。このモデルは、連続したオーディオ表現と離散トークンを同時に扱えるようにする単一のアーキテクチャの一部として機能します。離散オーディオトークンは、ニューラル圧縮アルゴリズム（例：ENCODEC）から派生しており、生成的オーディオ、音声、および音楽モデルが急速に増加しています。しかしながら、このアプローチの主な欠点の1つは、コンテキストの長さを扱うことです。高品質な生成アーキテクチャでは、次のトークン予測のために各周波数のオーディオコンテンツ全体を考慮する必要があるため、コンテキストが膨大になります。スペクトログラムなどの連続したオーディオ表現と離散音響トークンを組み合わせることで、両方の利点を保持します。つまり、特定の時間インスタンスのオーディオから必要なすべての情報を単一のトークンで保持しつつ、LLMに未来のトークンを予測させ、サンプリングやその他の利点を提供します。我々は、音声と音楽のためのトークンベースのLLMと比較して、次のトークン予測のパープレキシティと負の対数尤度スコアがどのように改善されるかを示します。

English

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

Whisper-GPT: ハイブリッド表現オーディオ大規模言語モデル

Whisper-GPT: A Hybrid Representation Audio Large Language Model

要旨

Support