AudioLDM 2：使用自監督預訓練學習全面音頻生成

摘要

儘管音訊生成在不同類型的音訊（如語音、音樂和音效）之間存在共通之處，但為每種類型設計模型需要仔細考慮特定目標和偏見，這些可能與其他類型顯著不同。為了讓我們更接近音訊生成的統一觀點，本文提出了一個框架，該框架利用相同的學習方法來生成語音、音樂和音效。我們的框架引入了一種稱為音訊語言（LOA）的音訊通用表示。任何音訊都可以基於AudioMAE轉換為LOA，這是一個自監督預訓練表示學習模型。在生成過程中，我們使用GPT-2模型將任何模態轉換為LOA，並使用一個以LOA為條件的潛在擴散模型進行自監督音訊生成學習。所提出的框架自然帶來了優勢，如上下文學習能力以及可重複使用的自監督預訓練的AudioMAE和潛在擴散模型。對於文本到音訊、文本到音樂和文本到語音的主要基準測試進行的實驗表明，相較於先前方法，我們實現了新的最先進或具有競爭力的性能。我們的演示和代碼可在https://audioldm.github.io/audioldm2找到。

English

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches. Our demo and code are available at https://audioldm.github.io/audioldm2.

AudioLDM 2：使用自監督預訓練學習全面音頻生成

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

摘要

Support