MAEB: 대규모 오디오 임베딩 벤치마크

초록

대규모 오디오 임베딩 벤치마크(MAEB)를 소개합니다. 이는 음성, 음악, 환경음 및 100개 이상의 언어를 아우르는 크로스모달 오디오-텍스트 추론 분야의 30개 과제를 포괄하는 대규모 벤치마크입니다. 50개 이상의 모델을 평가한 결과, 모든 과제에서 단일 모델이 압도적인 성능을 보이는 경우는 없었습니다: 대조적 오디오-텍스트 모델은 환경음 분류(예: ESC50)에서 뛰어난 반면, 다국어 음성 과제(예: SIB-FLEURS)에서는 무작위 수준에 가까운 점수를 보였고, 음성 사전훈련 모델은 정반대의 양상을 보였습니다. 클러스터링은 모든 모델에게 여전히 어려운 과제로, 가장 성능이 좋은 모델조차도 보통 수준의 결과만을 달성했습니다. 음향 이해에서 뛰어난 모델들은 언어 과제에서 종종 낮은 성능을 보이고, 그 반대의 경우도 관찰됩니다. 또한 MAEB에서의 오디오 인코더 성능은 해당 인코더가 오디오 대규모 언어 모델에 사용될 때의 성능과 높은 상관관계를 보입니다. MAEB는 98개 과제 컬렉션인 MAEB+에서 도출되었습니다. MAEB는 평가 비용을 절감하면서도 과제 다양성을 유지하도록 설계되었으며, 텍스트, 이미지, 오디오 양식에 걸친 통합 평가를 위한 MTEB 생태계에 통합됩니다. MAEB와 98개 전체 과제, 코드, 리더보드를 https://github.com/embeddings-benchmark/mteb 에서 공개합니다.

English

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

MAEB: 대규모 오디오 임베딩 벤치마크

MAEB: Massive Audio Embedding Benchmark

초록

Support