MMAE: 大規模マルチタスク音声編集ベンチマーク

要旨

MMAE（Massive Multitask Audio Editing benchmark）を紹介する。これは、汎用的な指示ベースの音声編集を対象とした初の包括的評価テストベッドである。インテリジェントな創作への潮流に刺激され、画像分野のNano-banana 2や動画分野のGemini-Omniといったモデルが先駆けとなって、インタラクティブ編集は視覚領域から音声へと急速に拡大してきた。しかし、現在の評価基盤は深刻に立ち遅れており、特定のサブドメインや基本的な操作に限定された極めて断片的な状態にとどまっている。範囲が限定的な既存のベンチマークとは異なり、MMAEは幅広い実世界シナリオに対応し、サウンド、音声、音楽、およびそれらの混合を含む7つの異なる音声モダリティを網羅する。さらに、基本修正からマルチホップ推論やマルチラウンド編集に至る6段階のタスク複雑性、2段階の粒度、8種類の操作タイプからなる包括的な分類体系を構築した。人間とエージェントの協働により丹念に厳選されたMMAEは、2,000件の高忠実度サンプルと、先駆的なルーブリックベースの評価フレームワークを組み合わせている。自由形式のタスクを17,741の検証可能なクライテリアに分解することで、この堅牢なルーブリックベースのパラダイムは、指示追従性と文脈一貫性の両方を正確かつ多次元的に評価することを可能にする。主要モデルを広範囲に評価した結果、現在のシステムは信頼性の高い編集を実現するには程遠いことが明らかになった。特に、Exact Match Rate（EMR）は常に5%を下回り、複雑な混合モダリティタスクでは絶対的な0%にまで低下しており、精密な実行と構造的頑健性における重大なボトルネックが露呈している。MMAEが、次世代音声編集システムに対する明確な診断ロードマップを提供し、標準化された長期的な評価パラダイムを確立することで、インテリジェント創作コミュニティの将来の進歩の触媒となることを期待している。

English

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.