Audiobox: 自然言語プロンプトによる統合音声生成

要旨

音声は私たちの生活において不可欠な要素ですが、その作成には専門知識が必要で時間もかかります。研究コミュニティは過去1年間、より強力な生成モデルの採用とデータのスケーリングを通じて、単一モダリティ（音声、音響、音楽）の大規模音声生成モデルの性能を大きく向上させてきました。しかし、これらのモデルにはいくつかの制御性の欠如があります。音声生成モデルはテキスト記述に基づく新しいスタイルを合成できず、屋外環境などのドメインカバレッジが限られています。音響生成モデルは「人が話している」といった大まかな記述に基づく制御しか提供できず、不明瞭な人間の声しか生成しません。本論文では、フローマッチングに基づく様々な音声モダリティを生成可能な統一モデルAudioboxを提案します。制御性を高め、音声と音響の生成パラダイムを統一するために、記述ベースと例示ベースのプロンプティングを設計しました。音声生成時に、トランスクリプト、ボーカル、その他の音声スタイルを独立して制御できるようにしました。限られたラベルでモデルの汎化性能を向上させるため、大量の未ラベル音声で事前学習する自己教師あり穴埋め目的関数を適用しました。Audioboxは音声と音響生成において新たなベンチマークを設定し（LibrispeechでのゼロショットTTSで0.745の類似度、AudioCapsでのテキストから音響生成で0.77のFAD）、新しいボーカルと音響スタイルの音声生成方法を可能にしました。さらに、Bespoke Solversを統合し、フローマッチングのデフォルトODEソルバーと比較して25倍以上の高速化を実現し、複数のタスクで性能を損なうことなく生成を高速化しました。デモはhttps://audiobox.metademolab.com/で公開しています。

English

Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/

Audiobox: 自然言語プロンプトによる統合音声生成

Audiobox: Unified Audio Generation with Natural Language Prompts

要旨

Support