MMAE: 대규모 멀티태스크 오디오 편집 벤치마크

초록

본 논문에서는 MMAE(Massive Multitask Audio Editing benchmark)를 소개한다. 이는 범용 명령 기반 오디오 편집을 위해 설계된 최초의 포괄적 평가 테스트베드이다. 지능형 창작으로의 전환에 힘입어, 이미지 분야의 Nano-banana 2, 비디오 분야의 Gemini-Omni와 같은 모델이 개척한 대화형 편집이 시각 영역에서 오디오로 급속히 확장되고 있다. 그러나 현재의 평가 인프라는 심각하게 뒤처져 있으며, 특정 하위 영역이나 기본 연산에 국한된 매우 파편화된 상태에 머물러 있다. 제한된 범위를 가진 기존 벤치마크와 달리, MMAE는 소리, 음성, 음악 및 이들의 혼합을 포함한 7가지의 서로 다른 오디오 모달리티에 걸쳐 실제 세계 시나리오의 광범위한 스펙트럼을 포괄한다. 또한, 기본 수정부터 다단계 추론 및 다중 라운드 편집에 이르기까지 6단계의 작업 복잡성, 2단계의 세분성, 그리고 8가지의 구별되는 연산 유형을 아우르는 포괄적인 분류 체계를 구축했다. 인간-에이전트 협업을 통해 세심하게 선별된 MMAE는 2,000개의 고충실도 샘플과 획기적인 루브릭 기반 평가 프레임워크를 결합하여 제공한다. 자유 형식의 작업을 17,741개의 검증 가능한 기준으로 분해함으로써, 이 강력한 루브릭 기반 패러다임은 명령 수행 능력과 맥락 일관성 모두에 대한 정밀하고 다차원적인 평가를 가능하게 한다. 주요 모델에 대한 광범위한 평가 결과, 현재의 시스템들은 신뢰할 수 있는 편집을 달성하는 데 아직 크게 미치지 못하는 것으로 나타났다. 놀랍게도, 정확 일치율(Exact Match Rate, EMR)은 지속적으로 5% 미만에 머물렀으며, 복잡한 혼합 모달리티 작업에서는 절대적인 0%까지 급락하여, 정밀한 실행과 구조적 견고성에 있어 심각한 병목 현상을 드러냈다. MMAE가 지능형 창작 커뮤니티의 미래 발전을 위한 촉매제가 되어, 명확한 진단 로드맵을 제공하고 차세대 오디오 편집 시스템을 위한 표준화되고 지속 가능한 평가 패러다임을 확립하기를 기대한다.

English

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.