Audiobox：利用自然語言提示實現統一音頻生成

摘要

音頻是我們生活中不可或缺的一部分，但創建音頻通常需要專業知識並且耗時。在過去一年中，研究社區在推動大規模音頻生成模型的性能方面取得了巨大進展，透過採用更強大的生成模型和擴展數據。然而，這些模型在幾個方面缺乏可控性：語音生成模型無法根據文本描述合成新的風格，並且在領域覆蓋範圍上存在限制，例如戶外環境；聲音生成模型僅基於描述提供粗粒度控制，如“一個人在說話”，並且只會生成含糊不清的人聲。本文提出了Audiobox，一個基於流匹配的統一模型，能夠生成各種音頻模態。我們設計了基於描述和基於示例的提示來增強可控性，並統一語音和聲音生成範式。在生成語音時，我們允許獨立控制文本記錄、聲音和其他音頻風格。為了在有限標籤下改進模型泛化能力，我們採用了自監督填充目標，在大量未標記音頻上進行預訓練。Audiobox在語音和聲音生成方面設定了新的基準（在Librispeech上零-shot TTS達到0.745相似度；在AudioCaps上文本轉聲音達到0.77 FAD），並開啟了生成具有新的聲音和聲學風格的音頻的新方法。我們進一步集成了Bespoke Solvers，相較於流匹配的默認ODE求解器，可以將生成速度提高超過25倍，而在多項任務上性能無損。我們的演示可在https://audiobox.metademolab.com/ 上找到。

English

Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/

Audiobox：利用自然語言提示實現統一音頻生成

Audiobox: Unified Audio Generation with Natural Language Prompts

摘要

Support