AudioX：面向任意到音频生成的扩散变换器

摘要

音訊與音樂生成已成為眾多應用中的關鍵任務，然而現有方法面臨顯著限制：它們孤立運作，缺乏跨模態的統一能力，受制於高品質多模態訓練數據的稀缺，且難以有效整合多樣化的輸入。在本研究中，我們提出了AudioX，一個基於擴散變換器的統一模型，專為任意到音訊及音樂生成而設計。與以往領域專屬模型不同，AudioX能夠高品質地生成通用音訊與音樂，同時提供靈活的自然語言控制，並無縫處理包括文本、視頻、圖像、音樂及音訊在內的多種模態。其核心創新在於一種多模態掩碼訓練策略，該策略跨模態掩碼輸入，迫使模型從掩碼輸入中學習，從而產生魯棒且統一的跨模態表示。為應對數據稀缺問題，我們精心策劃了兩個綜合數據集：基於VGGSound數據集的vggsound-caps，包含19萬條音訊描述；以及源自V2M數據集的V2M-caps，擁有600萬條音樂描述。大量實驗證明，AudioX不僅能與最先進的專用模型相媲美或超越之，還在處理多樣化輸入模態與生成任務方面展現出卓越的通用性，這一切均集成於一個統一架構之中。代碼與數據集將於https://zeyuet.github.io/AudioX/公開。

English

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/

AudioX：面向任意到音频生成的扩散变换器

AudioX: Diffusion Transformer for Anything-to-Audio Generation

摘要

Support