AudioX:面向任意内容到音频生成的扩散Transformer
AudioX: Diffusion Transformer for Anything-to-Audio Generation
March 13, 2025
作者: Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo
cs.AI
摘要
音频与音乐生成已成为众多应用中的关键任务,然而现有方法面临显著局限:它们各自独立运行,缺乏跨模态的统一能力,受限于高质量多模态训练数据的稀缺,且难以有效整合多样化的输入。在本研究中,我们提出了AudioX,一个面向“万物至音频与音乐生成”的统一扩散Transformer模型。与以往领域专用模型不同,AudioX能够高质量地生成通用音频及音乐,同时提供灵活的自然语言控制,并无缝处理包括文本、视频、图像、音乐和音频在内的多种模态。其核心创新在于一种多模态掩码训练策略,该策略跨模态掩码输入,迫使模型从掩码输入中学习,从而获得鲁棒且统一的跨模态表示。针对数据稀缺问题,我们精心构建了两个综合数据集:基于VGGSound数据集的vggsound-caps,包含19万条音频描述;以及源自V2M数据集的V2M-caps,拥有600万条音乐描述。大量实验证明,AudioX不仅匹配或超越了当前最先进的专用模型,还在统一架构内处理多样化输入模态及生成任务方面展现出卓越的通用性。代码与数据集将发布于https://zeyuet.github.io/AudioX/。
English
Audio and music generation have emerged as crucial tasks in many
applications, yet existing approaches face significant limitations: they
operate in isolation without unified capabilities across modalities, suffer
from scarce high-quality, multi-modal training data, and struggle to
effectively integrate diverse inputs. In this work, we propose AudioX, a
unified Diffusion Transformer model for Anything-to-Audio and Music Generation.
Unlike previous domain-specific models, AudioX can generate both general audio
and music with high quality, while offering flexible natural language control
and seamless processing of various modalities including text, video, image,
music, and audio. Its key innovation is a multi-modal masked training strategy
that masks inputs across modalities and forces the model to learn from masked
inputs, yielding robust and unified cross-modal representations. To address
data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K
audio captions based on the VGGSound dataset, and V2M-caps with 6 million music
captions derived from the V2M dataset. Extensive experiments demonstrate that
AudioX not only matches or outperforms state-of-the-art specialized models, but
also offers remarkable versatility in handling diverse input modalities and
generation tasks within a unified architecture. The code and datasets will be
available at https://zeyuet.github.io/AudioX/Summary
AI-Generated Summary