오디오-옴니: 다중 모달 이해를 다용도 오디오 생성 및 편집으로 확장

초록

최근 멀티모달 모델의 발전으로 오디오 이해, 생성, 편집 분야에서 급속한 진전이 이루어졌습니다. 그러나 이러한 기능들은 일반적으로 특화된 모델들에 의해 개별적으로 다루어져, 세 가지 작업을 원활하게 통합하는 진정한 통합 프레임워크의 개발은 충분히 탐구되지 않았습니다. 선도적인 일부 연구에서 오디오 이해와 생성을 통합하려는 시도가 있었으나, 이러한 연구들은 주로 특정 도메인에 국한되는 경향이 있습니다. 이를 해결하기 위해 우리는 일반 음향, 음악, 음성 도메인을 아우르는 생성과 편집을 통합하고 다중 모달 이해 능력을 갖춘 최초의 종단간(end-to-end) 프레임워크인 Audio-Omni를 소개합니다. 우리의 아키텍처는 고수준 추론을 위해 고정(frozen) 멀티모달 대형 언어 모델과 고품질 합성을 위해 학습 가능한 Diffusion Transformer를 시너지 효과적으로 결합합니다. 오디오 편집 분야의 중요한 과제인 데이터 부족 문제를 극복하기 위해, 우리는 100만 개 이상의 정밀하게 구성된 편집 쌍으로 이루어진 새로운 대규모 데이터셋인 AudioEdit를 구축했습니다. 광범위한 실험을 통해 Audio-Omni가 일련의 벤치마크에서 최첨단 성능을 달성하며, 기존 통합 접근법들을 능가하고 특화된 전문 모델들과 견줄 만하거나 우수한 성능을 보임을 입증했습니다. 핵심 기능을 넘어서, Audio-Omni는 지식 증강 추론 생성, 컨텍스트 내 생성(in-context generation), 오디오 생성을 위한 제로샷 교차 언어 제어 등 놀라운 계승 능력을 보여주며, 범용 생성형 오디오 인텔리전스로 나아가는 유망한 방향성을 제시합니다. 코드, 모델, 데이터셋은 https://zeyuet.github.io/Audio-Omni 에 공개될 예정입니다.

English

Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.

오디오-옴니: 다중 모달 이해를 다용도 오디오 생성 및 편집으로 확장

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

초록

Support