MMAE：一個大規模多任務音頻編輯基準

摘要

我們介紹MMAE——大規模多任務音訊編輯基準，這是首個專為通用指令式音訊編輯設計的綜合評估測試平台。受智慧創作趨勢推動，互動式編輯已從視覺領域（如圖像領域的Nano-banana 2模型和影片領域的Gemini-Omni模型）快速擴展到音訊領域。然而，當前的評估基礎設施嚴重滯後，仍然高度碎片化，局限於特定子領域或基本操作。與現有範圍有限的基準不同，MMAE擴展到廣泛的真實場景，涵蓋7種不同的音訊模態，包括聲音、語音、音樂及其混合。此外，我們建立了一個全面的分類體系，跨越6級任務複雜度（從基本修改到多跳推理和多輪編輯）、2級粒度以及8種不同的操作類型。透過人機協作精心策劃，MMAE包含2,000個高保真樣本，並配以開創性的基於評分標準的評估框架。透過將自由形式任務分解為17,741個可驗證的標準，這種穩健的基於評分標準的範式能夠對指令遵循和上下文一致性進行精確的多維評估。我們對領先模型進行的廣泛評估顯示，當前系統遠未實現可靠的編輯。值得注意的是，精確匹配率（EMR）始終低於5%，在複雜的混合模態任務中更是絕對降至0%，暴露了精確執行和結構穩健性的關鍵瓶頸。我們希望MMAE能夠成為智慧創作社群未來進步的催化劑，提供清晰的診斷路線圖，並為下一代音訊編輯系統建立標準化、持久的評估範式。

English

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.