AHELM: オーディオ言語モデルの包括的評価

要旨

音声言語モデル（ALMs）——音声とテキストを交互に入力として受け取り、テキストを出力するマルチモーダルモデル——の評価は、標準化されたベンチマークの不足によって妨げられている。ほとんどのベンチマークは1つまたは2つの能力しか測定せず、公平性や安全性などの評価的側面を省略している。さらに、個別の評価では限られた数のモデルしかテストされず、異なるプロンプト手法や推論パラメータが使用されるため、モデル間の比較が困難である。これらの欠点を解決するため、我々はAHELMを導入する。AHELMは、様々なデータセットを集約したベンチマークであり、ALMsの開発と使用において重要とされる10の側面——音声知覚、知識、推論、感情検出、バイアス、公平性、多言語性、堅牢性、毒性、安全性——を包括的に測定する。これには、新しい合成音声テキストデータセットであるPARADE（ALMsがステレオタイプを回避する能力を評価）とCoRe-Bench（会話音声に対する推論を多ターンの質問応答を通じて測定）も含まれる。また、モデル間の公平な比較を確保するため、プロンプト、推論パラメータ、評価指標を標準化した。我々は、3つの開発者から14のオープンウェイトおよびクローズドAPIのALMsと、自動音声認識器と言語モデルで構成された3つの追加のシンプルなベースラインシステムをテストした。結果は、Gemini 2.5 Proが10の側面のうち5つでトップにランクされているものの、ASRタスクにおいてグループ不公平性（p=0.01）を示す一方、他のほとんどのモデルはそうではないことを示している。また、ベースラインシステムがAHELMで比較的良好な性能を発揮し、音声からテキストへの変換能力しか持たないにもかかわらず、1つが全体で5位にランクされていることもわかった。透明性のため、すべての生のプロンプト、モデルの生成、出力はhttps://crfm.stanford.edu/helm/audio/v1.0.0で公開されている。AHELMは継続的に更新されるベンチマークであり、新しいデータセットとモデルが随時追加される予定である。

English

Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness (p=0.01) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.

AHELM: オーディオ言語モデルの包括的評価

AHELM: A Holistic Evaluation of Audio-Language Models

要旨

Support