Audiobox:使用自然语言提示实现统一音频生成
Audiobox: Unified Audio Generation with Natural Language Prompts
December 25, 2023
作者: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu
cs.AI
摘要
音频是我们生活中不可或缺的一部分,但要创建它通常需要专业知识并且耗时。在过去的一年中,研究界在推动大规模音频生成模型性能方面取得了巨大进展,针对单一模态(语音、声音或音乐),通过采用更强大的生成模型和扩展数据。然而,这些模型在多个方面缺乏可控性:语音生成模型无法根据文本描述合成新颖风格,并且在领域覆盖方面存在限制,比如室外环境;声音生成模型仅基于描述(如“一个人在说话”)提供粗粒度控制,只会生成含糊不清的人声。本文提出了Audiobox,这是一个基于流匹配的统一模型,能够生成各种音频模态。我们设计了基于描述和基于示例的提示,以增强可控性并统一语音和声音生成范式。在生成语音时,我们允许独立控制文本转录、声音和其他音频风格。为了提高模型在有限标签下的泛化能力,我们采用了自监督填充目标,在大量未标记音频上进行预训练。Audiobox在语音和声音生成方面树立了新的基准(在Librispeech上进行零样本TTS时相似度达到0.745;在AudioCaps上进行文本转声音时FAD为0.77),并为生成具有新颖声音和声学风格的音频开辟了新方法。我们进一步整合了Bespoke Solvers,相较于流匹配的默认ODE求解器,生成速度提高了超过25倍,而在多项任务上性能没有损失。我们的演示可在https://audiobox.metademolab.com/ 上查看。
English
Audio is an essential part of our life, but creating it often requires
expertise and is time-consuming. Research communities have made great progress
over the past year advancing the performance of large scale audio generative
models for a single modality (speech, sound, or music) through adopting more
powerful generative models and scaling data. However, these models lack
controllability in several aspects: speech generation models cannot synthesize
novel styles based on text description and are limited on domain coverage such
as outdoor environments; sound generation models only provide coarse-grained
control based on descriptions like "a person speaking" and would only generate
mumbling human voices. This paper presents Audiobox, a unified model based on
flow-matching that is capable of generating various audio modalities. We design
description-based and example-based prompting to enhance controllability and
unify speech and sound generation paradigms. We allow transcript, vocal, and
other audio styles to be controlled independently when generating speech. To
improve model generalization with limited labels, we adapt a self-supervised
infilling objective to pre-train on large quantities of unlabeled audio.
Audiobox sets new benchmarks on speech and sound generation (0.745 similarity
on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and
unlocks new methods for generating audio with novel vocal and acoustic styles.
We further integrate Bespoke Solvers, which speeds up generation by over 25
times compared to the default ODE solver for flow-matching, without loss of
performance on several tasks. Our demo is available at
https://audiobox.metademolab.com/