Audiobox:利用自然語言提示實現統一音頻生成
Audiobox: Unified Audio Generation with Natural Language Prompts
December 25, 2023
作者: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu
cs.AI
摘要
音頻是我們生活中不可或缺的一部分,但創建音頻通常需要專業知識並且耗時。在過去一年中,研究社區在推動大規模音頻生成模型的性能方面取得了巨大進展,透過採用更強大的生成模型和擴展數據。然而,這些模型在幾個方面缺乏可控性:語音生成模型無法根據文本描述合成新的風格,並且在領域覆蓋範圍上存在限制,例如戶外環境;聲音生成模型僅基於描述提供粗粒度控制,如“一個人在說話”,並且只會生成含糊不清的人聲。本文提出了Audiobox,一個基於流匹配的統一模型,能夠生成各種音頻模態。我們設計了基於描述和基於示例的提示來增強可控性,並統一語音和聲音生成範式。在生成語音時,我們允許獨立控制文本記錄、聲音和其他音頻風格。為了在有限標籤下改進模型泛化能力,我們採用了自監督填充目標,在大量未標記音頻上進行預訓練。Audiobox在語音和聲音生成方面設定了新的基準(在Librispeech上零-shot TTS達到0.745相似度;在AudioCaps上文本轉聲音達到0.77 FAD),並開啟了生成具有新的聲音和聲學風格的音頻的新方法。我們進一步集成了Bespoke Solvers,相較於流匹配的默認ODE求解器,可以將生成速度提高超過25倍,而在多項任務上性能無損。我們的演示可在https://audiobox.metademolab.com/ 上找到。
English
Audio is an essential part of our life, but creating it often requires
expertise and is time-consuming. Research communities have made great progress
over the past year advancing the performance of large scale audio generative
models for a single modality (speech, sound, or music) through adopting more
powerful generative models and scaling data. However, these models lack
controllability in several aspects: speech generation models cannot synthesize
novel styles based on text description and are limited on domain coverage such
as outdoor environments; sound generation models only provide coarse-grained
control based on descriptions like "a person speaking" and would only generate
mumbling human voices. This paper presents Audiobox, a unified model based on
flow-matching that is capable of generating various audio modalities. We design
description-based and example-based prompting to enhance controllability and
unify speech and sound generation paradigms. We allow transcript, vocal, and
other audio styles to be controlled independently when generating speech. To
improve model generalization with limited labels, we adapt a self-supervised
infilling objective to pre-train on large quantities of unlabeled audio.
Audiobox sets new benchmarks on speech and sound generation (0.745 similarity
on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and
unlocks new methods for generating audio with novel vocal and acoustic styles.
We further integrate Bespoke Solvers, which speeds up generation by over 25
times compared to the default ODE solver for flow-matching, without loss of
performance on several tasks. Our demo is available at
https://audiobox.metademolab.com/