AudioX：オーディオ生成のための拡散トランスフォーマー

要旨

音声と音楽の生成は多くのアプリケーションにおいて重要なタスクとして浮上しているが、既存のアプローチには重大な限界がある。それらはモダリティを横断した統一的な能力を持たずに孤立して動作し、高品質でマルチモーダルな訓練データが不足しており、多様な入力を効果的に統合するのに苦労している。本研究では、Anything-to-AudioおよびMusic Generationのための統一的なDiffusion TransformerモデルであるAudioXを提案する。従来のドメイン固有モデルとは異なり、AudioXは高品質な一般音声と音楽の両方を生成できるだけでなく、テキスト、動画、画像、音楽、音声といった様々なモダリティをシームレスに処理し、柔軟な自然言語制御を提供する。その鍵となる革新は、マルチモーダルなマスク訓練戦略であり、モダリティを横断して入力をマスクし、モデルにマスクされた入力から学習させることで、堅牢で統一的なクロスモーダル表現を獲得する。データ不足に対処するため、VGGSoundデータセットに基づく19万の音声キャプションを含むvggsound-capsと、V2Mデータセットから導出された600万の音楽キャプションを含むV2M-capsという2つの包括的なデータセットをキュレーションした。大規模な実験により、AudioXは最先端の専門モデルに匹敵するかそれを上回る性能を示すだけでなく、統一アーキテクチャ内で多様な入力モダリティと生成タスクを扱う際に驚くべき汎用性を発揮することが実証された。コードとデータセットはhttps://zeyuet.github.io/AudioX/で公開される予定である。

English

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/

AudioX：オーディオ生成のための拡散トランスフォーマー

AudioX: Diffusion Transformer for Anything-to-Audio Generation

要旨

Support