MMAE：大规模多任务音频编辑基准

摘要

我们提出了MMAEA（大规模多任务音频编辑基准），作为首个专为通用指令式音频编辑设计的综合性评估测试平台。在智能创作趋势的推动下，交互式编辑已从视觉领域（以图像领域的Nano-banana 2和视频领域的Gemini-Omni等模型为先驱）快速扩展到音频领域。然而，当前的评估基础设施严重滞后，仍然高度碎片化，局限于特定子领域或基础操作。与现有基准范围有限不同，MMAE覆盖了广泛的实际场景，包含7种不同的音频模态，包括声音、语音、音乐及其混合。此外，我们建立了一个全面的分类体系，涵盖6个任务复杂度层级（从基础修改到多跳推理和多轮编辑）、2个粒度层级以及8种不同的操作类型。通过人机协同精心策划，MMAE包含2000个高保真样本，并配备了一套开创性的基于评分标准的评估框架。通过将自由形式任务分解为17741个可验证标准，这种稳健的评分范式能够对指令遵循和上下文一致性进行精确的多维评估。我们对主流模型的广泛评估表明，当前系统远未实现可靠的编辑。值得注意的是，精确匹配率（EMR）持续低于5%，在复杂的混合模态任务中甚至降至绝对的0%，暴露出精确执行和结构鲁棒性的关键瓶颈。我们希望MMAE能够成为智能创作社区未来进步的催化剂，提供清晰的诊断路线图，并为下一代音频编辑系统建立标准化、持久的评估范式。

English

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.