ChatPaper.aiChatPaper

音频全能模型:将多模态理解拓展至通用音频生成与编辑领域

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

April 12, 2026
作者: Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo
cs.AI

摘要

近年來多模態模型的快速發展推動了音頻理解、生成與編輯技術的飛躍。然而這些功能通常由專用模型分別實現,能夠無縫整合三大任務的統一框架開發仍顯不足。儘管已有先驅性研究嘗試統一音頻理解與生成,但其應用往往侷限於特定領域。為此,我們提出Audio-Omni——首個在通用音響、音樂與語音領域實現生成與編輯統一,並集成多模態理解能力的端到端框架。該架構創新性地融合了用於高層次推理的凍結式多模態大語言模型,以及可訓練的擴散變壓器來實現高保真合成。為攻克音頻編輯領域的數據稀缺難題,我們構建了AudioEdit數據集,包含逾百萬組精心策劃的編輯配對樣本。大量實驗表明,Audio-Omni在系列基準測試中達到最先進性能,不僅超越既往統一框架,更在部分任務上媲美甚至優於專用模型。除核心功能外,該框架還展現出知識增強推理生成、上下文生成、零樣本跨語言音頻控制等衍生能力,為構建通用生成式音頻智能開闢了嶄新路徑。相關代碼、模型與數據集將公開於https://zeyuet.github.io/Audio-Omni。
English
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
PDF11April 15, 2026