MAEB：大规模音频嵌入基准测试

摘要

我们推出大规模音频嵌入基准（MAEB），该基准涵盖语音、音乐、环境声音及跨模态音频-文本推理四大领域的30项任务，支持100多种语言。通过对50余个模型进行评估，我们发现没有单一模型能在所有任务中占据主导地位：对比式音频-文本模型在环境声音分类（如ESC50）中表现优异，但在多语言语音任务（如SIB-FLEURS）中接近随机水平；而语音预训练模型则呈现相反模式。聚类任务对所有模型仍具挑战性，即使最优模型也仅取得中等结果。我们发现擅长声学理解的模型在语言任务中往往表现不佳，反之亦然。研究还表明，音频编码器在MAEB上的表现与其在音频大语言模型中的应用效果高度相关。MAEB源自包含98项任务的MAEB+数据集，其设计在保持任务多样性的同时降低了评估成本，并可集成至MTEB生态系统，实现文本、图像与音频模态的统一评估。我们在https://github.com/embeddings-benchmark/mteb 开源了MAEB基准、全部98项任务、代码及排行榜。

English

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.