Audiobox: 자연어 프롬프트를 통한 통합 오디오 생성

초록

오디오는 우리 삶의 필수적인 부분이지만, 이를 생성하는 데는 전문 지식이 필요하며 시간이 많이 소요됩니다. 연구 커뮤니티는 지난 한 해 동안 더 강력한 생성 모델을 도입하고 데이터를 확장함으로써 단일 모달리티(음성, 소리, 음악)에 대한 대규모 오디오 생성 모델의 성능을 크게 향상시켰습니다. 그러나 이러한 모델들은 여러 측면에서 제어 가능성이 부족합니다: 음성 생성 모델은 텍스트 설명을 기반으로 새로운 스타일을 합성할 수 없으며, 야외 환경과 같은 도메인 범위가 제한적입니다; 소리 생성 모델은 "사람이 말하는 소리"와 같은 거친 수준의 설명만을 제공하며, 중얼거리는 인간의 목소리만 생성할 수 있습니다. 본 논문은 다양한 오디오 모달리티를 생성할 수 있는 플로우 매칭 기반의 통합 모델인 Audiobox를 소개합니다. 우리는 제어 가능성을 강화하고 음성 및 소리 생성 패러다임을 통합하기 위해 설명 기반 및 예제 기반 프롬프트를 설계했습니다. 음성을 생성할 때, 텍스트, 보컬 및 기타 오디오 스타일을 독립적으로 제어할 수 있도록 했습니다. 제한된 라벨로 모델의 일반화를 개선하기 위해, 대량의 라벨 없는 오디오에 대해 자기 지도 학습 방식의 인필링 목적 함수를 적용하여 사전 학습을 진행했습니다. Audiobox는 음성 및 소리 생성에서 새로운 벤치마크를 설정했습니다(제로샷 TTS에서 Librispeech 기준 0.745 유사도; 텍스트-투-사운드에서 AudioCaps 기준 0.77 FAD). 또한 새로운 보컬 및 음향 스타일로 오디오를 생성하는 새로운 방법을 개척했습니다. 우리는 Bespoke Solvers를 통합하여, 여러 작업에서 성능 저하 없이 플로우 매칭의 기본 ODE 솔버 대비 25배 이상 빠른 생성을 가능하게 했습니다. 데모는 https://audiobox.metademolab.com/에서 확인할 수 있습니다.

English

Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/

Audiobox: 자연어 프롬프트를 통한 통합 오디오 생성

Audiobox: Unified Audio Generation with Natural Language Prompts

초록

Support