AHELM：音頻-語言模型的全面評估

摘要

對音頻-語言模型（ALMs）——這類多模態模型以交錯的音頻和文本作為輸入並輸出文本——的評估，因缺乏標準化基準而受到阻礙；大多數基準僅衡量一兩種能力，並忽略了如公平性或安全性等評估方面。此外，由於不同的評估測試僅涉及有限數量的模型，並使用不同的提示方法和推理參數，模型間的比較變得困難。為解決這些不足，我們引入了AHELM，一個彙集了多種數據集的基準——包括兩個新的合成音頻-文本數據集PARADE（用於評估ALMs在避免刻板印象方面的表現）和CoRe-Bench（通過推理性多輪問答來衡量對話音頻的推理能力）——以全面衡量ALMs在我們認為對其開發和使用至關重要的10個方面的表現：音頻感知、知識、推理、情感檢測、偏見、公平性、多語言性、魯棒性、毒性和安全性。我們還標準化了提示、推理參數和評估指標，以確保模型間的公平比較。我們測試了來自3個開發者的14個開放權重和封閉API的ALMs，以及3個額外的簡單基線系統，每個系統由一個自動語音識別器和一個語言模型組成。我們的結果顯示，儘管Gemini 2.5 Pro在10個方面中的5個方面排名第一，但在ASR任務上表現出群體不公平性（p=0.01），而大多數其他模型則沒有。我們還發現，基線系統在AHELM上表現相當不錯，其中一個僅具備語音轉文本能力的系統總體排名第五。為了透明性，所有原始提示、模型生成和輸出均可於我們的網站https://crfm.stanford.edu/helm/audio/v1.0.0上獲取。AHELM旨在成為一個持續更新的基準，新的數據集和模型將隨時間添加。

English

Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness (p=0.01) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.

AHELM：音頻-語言模型的全面評估

AHELM: A Holistic Evaluation of Audio-Language Models

摘要

Support