MAEB: 大規模音声埋め込みベンチマーク

要旨

我々は、大規模音声埋め込みベンチマーク（MAEB）を提案する。これは100以上の言語にわたり、音声、音楽、環境音、クロスモーダルな音声-テキスト推論を含む30のタスクを網羅する大規模ベンチマークである。50以上のモデルを評価した結果、全てのタスクを単一のモデルが支配するものは存在しないことが明らかになった。対照学習型音声-テキストモデルは環境音分類（例：ESC50）で優れるが、多言語音声タスク（例：SIB-FLEURS）ではほぼランダムなスコアを示す。一方、音声事前学習モデルは逆の傾向を示した。クラスタリングは全てのモデルにとって課題であり、最高性能のモデルでも僅かな成果しか得られなかった。音響的理解で優れるモデルは言語タスクで苦戦し、その逆も成立する傾向が観測された。また、音声エンコーダのMAEBにおける性能は、それらを音声大規模言語モデルで使用した場合の性能と高い相関を示すことを実証した。MAEBは98のタスクから成るMAEB+に由来する。MAEBは評価コストを削減しつつタスク多様性を維持するよう設計され、テキスト・画像・音声モダリティ横断的な統一評価を実現するMTEBエコシステムに統合されている。MAEBおよび全98タスクのコードとリーダーボードをhttps://github.com/embeddings-benchmark/mteb で公開する。

English

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

MAEB: 大規模音声埋め込みベンチマーク

MAEB: Massive Audio Embedding Benchmark

要旨

Support