音频全感:将多模态理解拓展至通用音频生成与编辑
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
April 12, 2026
作者: Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo
cs.AI
摘要
近年来,多模态模型的快速发展推动了音频理解、生成与编辑技术的迅猛进步。然而,这些功能通常由专用模型分别实现,能够无缝整合三大任务的统一框架开发仍显不足。尽管已有先驱性工作尝试统一音频理解与生成功能,但其应用往往局限于特定领域。为此,我们提出Audio-Omni——首个在通用音频、音乐与语音领域实现生成与编辑统一,并集成多模态理解能力的端到端框架。该架构创新性地融合了用于高层推理的冻结式多模态大语言模型与可实现高保真合成的可训练扩散变换器。为克服音频编辑领域关键的数据稀缺问题,我们构建了AudioEdit数据集,包含超过百万组精心策划的编辑配对样本。大量实验表明,Audio-Omni在多项基准测试中均达到最先进性能,不仅超越现有统一方法,更在部分任务上媲美甚至优于专业模型。除核心功能外,该框架还展现出知识增强推理生成、上下文生成、零样本跨语言音频生成控制等卓越的衍生能力,为构建通用音频生成智能体指明了方向。相关代码、模型及数据集将公开发布于https://zeyuet.github.io/Audio-Omni。
English
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.