MAEB: Massieve Audio Embeddings Benchmark

Samenvatting

Wij introduceren de Massive Audio Embedding Benchmark (MAEB), een grootschalige benchmark die 30 taken omvat op het gebied van spraak, muziek, omgevingsgeluiden en cross-modale audio-tekst redenering in 100+ talen. We evalueren 50+ modellen en constateren dat geen enkel model dominant is over alle taken: contrastieve audio-tekstmodellen excelleren in de classificatie van omgevingsgeluiden (bijv. ESC50), maar scoren bijna willekeurig op meertalige spraaktaken (bijv. SIB-FLEURS), terwijl op spraak voorgetrainde modellen het omgekeerde patroon vertonen. Clustering blijft voor alle modellen een uitdaging, waarbij zelfs het best presterende model slechts bescheiden resultaten behaalt. We observeren dat modellen die uitblinken in akoestisch begrip vaak slecht presteren op linguïstische taken, en vice versa. We tonen ook aan dat de prestaties van audio-encoders op MAEB sterk correleren met hun prestaties wanneer ze worden gebruikt in audio large language models. MAEB is afgeleid van MAEB+, een verzameling van 98 taken. MAEB is ontworpen om taakdiversiteit te behouden terwijl de evaluatiekosten worden verlaagd, en het integreert in het MTEB-ecosysteem voor uniforme evaluatie over tekst-, beeld- en audiomodaliteiten. We brengen MAEB en alle 98 taken uit, samen met code en een leaderboard, op https://github.com/embeddings-benchmark/mteb.

English

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

MAEB: Massieve Audio Embeddings Benchmark

MAEB: Massive Audio Embedding Benchmark

Samenvatting

Support